PRE2020 3 Group6

From Control Systems Technology Group
Jump to navigation Jump to search

Sign to text software

Group Members

Name Student ID Department Email address
Ruben Wolters 1342355 Computer Science r.wolters@student.tue.nl
Pim Rietjens 1321617 Computer Science p.g.e.rietjens@student.tue.nl
Pieter Michels 1307789 Computer Science p.michels@student.tue.nl
Sterre van der Horst 1227255 Psychology and Technology s.a.m.v.d.horst1@student.tue.nl
Sven Bierenbroodspot 1334859 Automotive Technology s.a.k.bierenbroodspot@student.tue.nl

Problem Statement and Objective

At the moment 466 million people suffer from hearing loss, it has been predicted that this number will increase to 900 million by 2050. Hearing loss has, among other things, a social and emotional impact on ones life. The inability to communicate easily with others can cause an array of negative emotions such as loneliness, feeling of isolation and sometimes also frustration [1]. Although there are many different types of speech recognition technologies for live subtitling that can help people that are deaf or hard of hearing (DHH), these feelings can still be exacerbated during online meetings. DHH individuals must concentrate on the person talking, the interpretation, and of any potential interruptions that can occur [2]. Furthermore, to be able to take part in the discussion, they must be able to spontaneously react in conversation. However, not everyone understands sign language which makes communicating even more difficult. Nowadays, especially due to the COVID-19 pandemic, it is becoming more normal to work from home and therefore the number of online meetings is increasing quickly [3]. This leads us to our objective: to deveop software that translates Sign Language to text to help DHH individuals communicate in an online environment. This system will be a tool that DHH individuals can use to communicate during online meetings. The number of people that have to work or be educated from home has rapidly increased due to the COVID-19 pandemic [4]. This means that the number of DHH individuals that have to work in online environments also increases. Previous studies have shown that DHH individuals obtain lower score on an Academic Engagement Form for communication compared to students with no disability [5]. This finding can be explained by the fact that DHH people are usually unable to understand speech without aid. This aid can be a hearing aid, technology that convert speech to text, or even an interpreter, however the latter is expensive and not available for most DHH individuals. To talk to or react to other people, DHH individuals can use pen and paper, or in an online environment by typing. However, this is a lot slower than speech or sign language which makes it almost impossible for DHH individual to keep up with the impromptu nature of discussions or online meetings [6]. Therefore, by creating software that can convert sign language to text, or even to speech, DHH individuals will be able to actively participate in meeting. To do this, it is important to understand what sign language is. The following section of this wiki page, will explain the different elements of sign language and what it is.

Sign Language: what is it?

Sign language is a natural language that is predominantly used by people that are deaf or hard of hearing, but also by hearing people as well. Of all the children who are born deaf, 9 out of 10 are born to hearing parents. This means that the parents often have to learn sign language alongside the child [7].

Sign language is comparable to spoken language in the sense that it differs per country. American Sign Language (ASL) and British Sign Language (BSL) were developed separately and are therefore incomparable, meaning that people that use ASL will not necessarily be able to understand BSL [7].

It does not express single words, it expresses meanings. For example, the word right has two definitions. It means correct, and opposite of left. In spoken English, right is used for both meanings. In sign language, there are different signs for the different definitions of the word right. A single sign can also mean a whole entire sentence. By varying the hand orientation and direction, the meaning of the sign, and therefore the sentence, changes [8].

Having said that, all sign languages rely on certain parameters, or a selection of these parameters, to indicate meaning. These parameters are [9]:

  • Handshape: the general shape ones hands, and fingers make;
  • Location: where the sign is located in space, body and face are used as reference points to indicate location;
  • Movement: how the hands move;
  • Number of hands: this naturally refers to how many hands are used for the sign, and it also refers to the ‘relationship of the hands to each other’ ;
  • Palm orientation: this is how the forearm and wrist rotate when signing;
  • Non-manuals: this refers to the face and body. Facial expressions can be used for different meanings, or lexical distinctions. They can also be used to indicate mood, topics and aspect.

According to the study by Tatman, the first three parameters are universal in all sign languages. However, using facial expressions for lexical distinctions is something that is not used in most languages. The use of parameters also depends on cultural and cognitive context and feasibility of that parameters [9].

https://books.google.nl/books?id=dnxDgCvEnJoC&lpg=PR11&ots=03dwMQ7zBO&dq=sign%20language&lr&pg=PA1#v=onepage&q=sign%20language&f=false

USE

User

The target user group of our software are people who are not able to express themselves with speech. There are different causes for people not being able to express themselves. People who are deaf or hard of hearing often cannot speak. There are also people who cannot speak due to physical disabilities. When a person cannot express itself through speech the person usually is able to express itself through sign language. Because of this non-verbal manner of expression the target group is not able to function like a person who is able to speak in a meeting. This holds for both online and offline meetings. The solution for this is to hire an interpreter to translate the sign language into speech. This is somewhat expensive and not accessible by everyone. The software is meant to be accessible for everyone to make meetings possible for the target group.


Our product will enable the user to express itself using not only text but also their facial expressions. This will help the user to function in meetings more fluid. This will allow the user to have more possibilities in the job market.

Society

Enterprise

Design concept

The user will participate in an online meeting using a camera with a sufficient quality. The video feed of the user will be extracted by the sign to text software. It will be able to detect when it should be extracting frames through the detection of a hand in the screen or through manual activation. The extracted frames will be used to compute the output using a neural network. The neural network will be trained using a dataset of sign language. It will be trained to detect single letters and numbers. The program will take the frames that are input and match it with one of the letters or numbers.


The output of the program is the sequence of letters and numbers the user has signed. The output will be displayed like subtitles would be. The delay of these subtitles will have to be as low as possible since this will increase the effect of facial expressions in their sentences.

State of the art

Research into sign language, luckily, has already been done many times before this project. The challenges that previous investigations have run into are well documented and defined, allowing us to work around them and being able to identify them early in the project. Furthermore, due to the large amount of research already done into sign language recognition it is easier for us to identify what methods to pick for this project.


In this state-of-the-art section the most up to date findings will be shortly discussed. First, the challenges that this project can face are considered. Second, several different approaches that have been taken are reviewed. Finally, out of all this information the method that has been chosen for this project will be explained.

Challenges

The first and most obvious challenge originates from sign language itself. Signs are often not identifiable based on one frame or image. Almost all signs require some sort of motion which is simply not possible to capture in an image. Therefore, whatever classifier is decided on for this project, it is most likely going to have to be able to identify signs from video, which is significantly harder than recognizing signs from just one image. Classifying signs from video, although much more difficult than classifying from images, is not impossible. An example of a successful sign language recognizer is given by Pu, Junfu, Zhou, Wengang and Li, Houqiang in their 2019 paper “Iterative Alignment Network for Continuous Sign Language Recognition” [10]. In this paper a method is proposed to map some T number of frames to an amount L of words or signs. Sadly, this method combines several highly advanced neural network techniques, which are for sure out of the scope of this course. Furthermore, this method is designed specifically for use with 3D videos, which is something users of this product might not have access to.


Furthermore, signs can be executed at different speeds. Obviously, there will be marginal differences between every other signer, but a significant speed difference in the execution of a sign can also carry some meaning. For instance, by doing the sign for running much faster than normal the signer can convey that he had to run very fast to get somewhere.


One more issue that sign language recognizers can run into is the use of depth in signs. A good example of a sign where this can become problematic is the sign for ‘gray’ in American Sign Language (ASL), which is done as shown in this video: https://www.signingsavvy.com/search/gray. Notice how the signer must wave his or her hands through each other for a few seconds. Using a 2D camera this depth is impossible to capture. Evidently, requiring users of this product to have a 3D camera or a setup with multiple cameras is not a feasible option.


Moreover, signs can look very much like each other. Again, the sign for ‘gray’ comes to mind. The signs for both ‘gray’ and ‘whatever’ are very similar (see this link for how ‘whatever’ can be signed: https://www.signingsavvy.com/search/whatever). Both signs have the signer moving their hands back and forth with very little difference in hand gesture. A way around this problem can be found in the facial expression of the signer. As shown by U. von Agris, M. Knorr and K. Kraiss the facial expressions that are made during signing are key to recognizing the correct sign [11]. This is also something that can be seen if attention is paid to the face of the signers in the videos for the ‘gray’ and ‘whatever’ signs. During execution of the sign, it is almost like the signer is saying the word out loud. The only difference of course being that no sound is made. However, as shown in the paper, this is not the only facial feature that can help with classification. As shown in the paper, the facial expression of the signer can change the meaning of a sign. Therefore, to make even better sign language recognition software, the facial expressions should be considered as well.


The fact that signs can look very much alike is not really an issue that is specific to this project. In general, for all neural networks it will be very hard to classify two images that are very similar. Even when adding another dimension this problem is not easy to solve for hand gestures, as shown by L. K. Phadtare, R. S. Kushalnagar and N. D. Cahill [12]. In this paper an algorithm is proposed to detect hand gestures using the Kinect, a camera that has depth perception. The algorithm is shown to be incredibly accurate in distinguishing hand gestures that are very different from one another. However, the authors also show that the smaller the difference between gestures, the less accurate the classification will be.


However, the fact that the face can impact the meaning of signs so much means that the classifier should look at more than just the hands, adding another layer of difficulty to this already hard problem. Luckily, this is an issue that has already been looked at before by K. Imagawa, Shan Lu and S. Igi [13]. In this paper methods to distinguish specifically the face and hands from a 2D image are discussed. The idea of the proposed solution is to use a color mapping based on the skin color of the signer to identify hand and face ‘blobs’ in the image. Although the findings in this paper are very interesting, the implementation difficulty and time must be considered.


Another thing that might be problematic specifically for this project is the fact that the background used in the images or videos that need to be classified can be very cluttered. This can lead to the classifier not being able to distinguish between the hands of the signer and the background, leading to wrong classifications. Although in meetings it is usually customary to have a clear background, this is not something that can be expected of potential users of this product.

Approaches

To make sign language recognizers several approaches have been taken throughout the years. Some of these approaches have been highlighted by Cheok, M.J., Omar, Z. and Jaward, M.H. in their article reviewing existing hand gesture and sign recognition techniques [14]. In this article the several approaches have been split into two major classes: vision and sensor based.

Vision based

Vision based sign language recognition, as the name implies, uses videos or images acquired using one or more cameras and possibly extended with extra more sophisticated techniques.


The easiest approach – to set up at least - in the vision-based category is the one using only one camera. This would result in some recognizer classifying signs based on 2D images or videos. However, this one camera could be extended to use a combination of several cameras in order to give the data representation some sense of depth.


Both approaches can be expanded on using markers to more easily identify and even track hands. Examples of this are mentioned in the article by Cheok, M.J., Omar, Z. and Jaward, M.H. as well: signers can be asked to wear two differently colored gloves or little wristbands. These colored gloves for instance were used in the paper mentioned earlier by K. Imagawa, Shan Lu and S. Igi [15].

Sensor based

Sensor based sign language recognition although very promising and interesting is also much harder to make and to set up. Sensor based approaches require the measuring of something using one or more sensors or instruments. These sensors could be used to track for instance the position of the hands during signing, resulting in a more abstract ‘image’ of where the hands are during a sign. However, the sensors could also be used to measure things like velocity and acceleration, which as discussed earlier, is probably very helpful since most signs can not be made by holding your hands in one position. One step further could mean biological signals, like the number of electrical pulses through the muscles required to make some sign.

Our solution

The solution that was used in this project mostly came down to what can be expected from users of the product. The goal of this project is a sign language to text interpreter so deaf people can participate in online meetings more easily. Since it just can not be expected of users to have expansive 3D cameras or to get all kinds of equipment before the classifier works it seemed kind of obvious to go for the easiest to implement solution: the product will be convolutional neural network that can classify 2D images to the correct signs. Do note that this being the ‘easiest’ solution does not imply that it will work very well, in fact, due to the lack of information caused by only having access to a 2D image or video feed the product might actually end up being very bad at recognizing sign language.

Technical specifications

Classifier part: To accurately detect what a user (hearing impaired/mute/deaf person) wants to convey from video footage of them using sign language we have decided to look into making classifier models. In this scenario a classifier model should correctly classify (parts of) video footage to contain the signed language actually within that video footage. To make this classifier we have opted to use neural networks. To build a classifier you will need a dataset containing labelled images for the classes which the classifier is supposed to detect. This data is then divided in train, test and validation datasets where no 2 sets overlap. The train data set is the data on which the models of the neural network is trained, the validation set is used to compare performances between models and the test dataset is used to test the accuracy of the eventual generated models of the neural network. For sign language we have found two distinctly different types of sign language which require different approaches in how to make an effective classifier for them. Signs which consist of a single hand position and do not contain movement, such as the sign for the number 1, and signs throughout which the hands appear in different positions and thus naturally contain movement, such as the sign for butterfly. The main difference for the classifiers for these 2 types of signs is that all the information for the first type of signs can be detected from a single frame, while for the second type of signs you will require multiple frames to relay all the relevant information regarding the sign. As such henceforth we will refer to these types of signs as single-frame and multiple-frame signs. Building a classifier for strictly single-frame signs is (considerably) easier as the classifier only needs to look at a single frame. The input of the classifier will consist of a single image file with consistent resolutions. As such there is by default uniformity within the data and there is little need for pre-processing of the data. When inputted into the neural network, the image file will be turned by the computer as a matrix of pixels. For the images which we used for the single-frame signs for classifying alphabet signs, we used images of 200X200 pixels which were scaled grey. Scaling the images grey means the input for the neural network and its models will consist of a single matrix of 200 by 200 values. If we were to have used coloured images, then an image would consist of 3 layers of 2 dimensional matrices, one for each colour in RGB (Red, Green, Blue). To create the classifier for single-frame signs, we used a neural network with convolution layers and max pooling layers. The convolution layers in the neural network convolute a group of values in the matrix by applying a kernel on the group of values and returning a single value for the next layer of the neural network. The kernel can be for example 3X3 pixels and give only count the left most pixels and the centre pixel, the resulting value of this kernel would give the summed up values of the counted pixels. The kernel goes over all groups of the matrix and returns a new matrix with new values, the new layer of the neural network (this does not decrease the amount of values in the matrix, it would still be 200X200 as the values in the groups are not exclusive). In this way the convolutional layers are meant to extract high-level features from the images, such as where the edges of images or critical sections of images are. In addition to this max pooling layers are used to decrease the size of the matrices. A max pooling layer runs a kernel over a matrix but with a larger stride (the distance between the placement of the kernel) so less values are outputted as less groups of pixels are inspected. Max pooling simply returns the largest value in a kernel, which is meant to summarize the information in a group of pixels in a single pixel to smoothen out the layers in the neural network. After the convolution layers and max pooling layers have been applied to the original matrix representing the inputted image, a flattening layer is applied to turn the matrix into a single vector of values. On this single vector which now represents all the relevant information from the original image, we apply fully connected layers (also known as dense layers) which predicts the correct label for the inputted image. The fully connected layers apply weight to the values in the inputted vector and calculate the predicted probabilities for each class within our classifier. In the first few fully connected layers the ReLU (Rectified Linear Units) activation function is used to reduce the size of the vector. In the last fully connected layer which outputs the results for all the classes the softmax activation function is used to normalize the vector to output vectors between 0 and 1, denoting the probabilities for the inputted image to be each class.


Realization

Testing

Design evaluation

Week 1

Week 1 mostly consisted of putting together a group and decide upon a topic. We settled on the topic of emotion recognition on children with ASD. Research has been done on this topic and references to similar projects have been gathered. The focus for next week is to explore what is possible to achieve within this topic.

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (2h 30m), summarizing and further reading of sources (4h 30m)
Sterre van der Horst 1227255 10 preparing group meeting (30m), meeting with group - deciding subject (1h 30m), finding relevant sources (2h 30m), summarizing and reading sources (4h 30m), finding more sources (1h)
Pieter Michels 1307789 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (2h), summarizing and further reading of sources (5h)
Pim Rietjes 1321617 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (3h), summarizing and further reading of sources (4h)
Ruben Wolters 1342355 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (3h), summarizing and further reading of sources (4h)

Week 2

In week 2 we decided after discussing the possible deliverables and came to the conclusion that it is difficult to find a dataset which we could use. The creation of a dataset is nearly impossible due to the slim target group and the current corona measures. for these reasons we abandoned the subject and discussed a new topic. The selected topic is to develop software which can convert sign language into text using video as an input.

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 7 meeting with supervisor (1h), ideation for new topics (2h), meeting deciding on new subject (1h), reading about new subject (3h)
Sterre van der Horst 1227255 13.75 preparing meeting with supervisor (45m), meeting with supervisor (1h), finding new sources about ASD (3h), analyzing new sources (2h), meeting deciding new subject (1h), finding new sources about new subject (3h), summarizing new sources (3h)
Pieter Michels 1307789 10 meeting with supervisor (1h), reading on old subject (2h), looking for databases on old subject (2h), meeting deciding on new subject (1h), reading about new subject (4h)
Pim Rietjes 1321617 11.5 meeting with supervisor (1h), reading on old subject (3h), looking for databases on old subject (3h), meeting deciding on new subject (1h), looking at databases for new subject (3.5h)
Ruben Wolters 1342355 description

Week 3

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 6 meeting with supervisor (1h), introducing wiki structure (1h), reading and summarizing sources (4h)
Sterre van der Horst 1227255 10.5 preparing meeting with supervisor (30m), meeting with supervisor (1h), finding and reading new sources (5h), writing introduction and problem statement (2h), rewriting problem statement/introduction and adding to wiki (2h)
Pieter Michels 1307789 11 meeting with supervisor (1h), reading and summarizing sources (4h), setting up coding environment (3h), Getting familiar with Tensorflow (3h)
Pim Rietjes 1321617 10 meeting with supervisor (1h), looking into example classifiers (2h), downloading and exploring datasets (3h), setting up coding environment (3h), getting familiar with Tensorflow (1h)
Ruben Wolters 1342355 meeting with supervisor (1h)

Week 4

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 5 research sign language (1h), group meeting (30m), research into user (3h), user text on wiki (30m)
Sterre van der Horst 1227255 6.5 research sign language (2h), writing section about what is sign language (2h), group meeting (30m), adding references to previously written problem statement and what is sign language (30m), first draft questionnaire (1.5h)
Pieter Michels 1307789 7,5 meeting with supervisor (1h), adding timetables to wiki (30m), investigate into tensorflow and Keras (2h 30m), group meeting (30m), Working on classifier (3h)
Pim Rietjes 1321617 12,5 meeting with supervisor (1h), looking into example classifiers (2h), downloading and exploring datasets (1h), preprocessing data (4h), group meeting (30m), Working on classifier (4h)
Ruben Wolters 1342355 description

Week 5

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 6,5 Video for different display possibilities (5h), Flowchart global design specs (1h 30m)
Sterre van der Horst 1227255 description
Pieter Michels 1307789 10h meeting with supervisor (1h), worked on the preprocessor for the classifier (5h), making sure preprocessor works (1h), working on state of the art section - which then got deleted for some unkown reason so I have to do it again :))))) (3h)
Pim Rietjes 1321617 13.5h meeting with supervisor (1h), writing text on the classifier (2.5h), calculating optical flow (moving signs) (4h), working on preprocessor (1h), working on neural network (static signs) (2h), working on neural network (moving signs) (3h)
Ruben Wolters 1342355 description

Week 6

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 Meeting with supervisor (1h)
Sterre van der Horst 1227255 description
Pieter Michels 1307789 10h Meeting with supervisor (1h), rewriting/checking the state-of-the-art section (5h), investigating how to integrate classifier with teams (4h)
Pim Rietjes 1321617 description
Ruben Wolters 1342355 description

Week 7

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 description
Sterre van der Horst 1227255 description
Pieter Michels 1307789 description
Pim Rietjes 1321617 description
Ruben Wolters 1342355 description

Week 8

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 description
Sterre van der Horst 1227255 description
Pieter Michels 1307789 description
Pim Rietjes 1321617 description
Ruben Wolters 1342355 description

References

  1. [1] Deafness and Hearing Loss - World Health Organization. (2021) WHO.
  2. [2] Peruma, A., & El-Glaly, Y. N. (2017). CollabAll: Inclusive discussion support system for deafand hearing students. ASSETS 2017 - Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, 315–316.
  3. [3] Microsoft Teams reaches 115 million DAU—plus, a new daily collaboration minutes metric for Microsoft 365 - Microsoft 365 Blog. (2021).
  4. [4] European Commission. (2020). Telework in the EU before and after the COVID-19 : where we were , where we head to. Science for Policy Briefs, 2009, 8 .
  5. [5] Richardson, J. T. E., Long, G. L., & Foster, S. B. (2004). Academic engagement in students with a hearing loss in distance education. Journal of Deaf Studies and Deaf Education, 9(1), 68–85.
  6. Glasser, A., Kushalnagar, K., & Kushalnagar, R. (2019). Deaf, Hard of Hearing, and Hearing perspectives on using Automatic Speech Recognition in Conversation. ArXiv, 427–432.
  7. 7.0 7.1 [6]Scarlett, W. G. (2015). American Sign Language. The SAGE Encyclopedia of Classroom Management.
  8. Perlmutter, D. M. (2013). What is Sign Language ? Linguistic Society of America, 6501(202).
  9. 9.0 9.1 [7] Tatman, R. (2015). The Cross-linguistic Distribution of Sign Language Parameters. Proceedings of the Annual Meeting of the Berkeley Linguistics Society, 41(January).
  10. Pu, Junfu & Zhou, Wengang & Li, Houqiang. (2019). Iterative Alignment Network for Continuous Sign Language Recognition. 4160-4169. https://doi.org/10.1109/CVPR.2019.00429
  11. U. von Agris, M. Knorr and K. Kraiss, "The significance of facial features for automatic sign language recognition," 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, Amsterdam, Netherlands, 2008, pp. 1-6, doi: 10.1109/AFGR.2008.4813472.
  12. L. K. Phadtare, R. S. Kushalnagar and N. D. Cahill, "Detecting hand-palm orientation and hand shapes for sign language gesture recognition using 3D images," 2012 Western New York Image Processing Workshop, Rochester, NY, USA, 2012, pp. 29-32, doi: 10.1109/WNYIPW.2012.6466652.
  13. K. Imagawa, Shan Lu and S. Igi, "Color-based hands tracking system for sign language recognition," Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998, pp. 462-467, doi: 10.1109/AFGR.1998.670991.
  14. Cheok, M.J., Omar, Z. & Jaward, M.H. A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. & Cyber. 10, 131–153 (2019). https://doi.org/10.1007/s13042-017-0705-5
  15. K. Imagawa, Shan Lu and S. Igi, "Color-based hands tracking system for sign language recognition," Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998, pp. 462-467, doi: 10.1109/AFGR.1998.670991.