PRE2020 3 Group6

From Control Systems Technology Group
Jump to navigation Jump to search

Sign to text software

Group Members

Name Student ID Department Email address
Ruben Wolters 1342355 Computer Science r.wolters@student.tue.nl
Pim Rietjens 1321617 Computer Science p.g.e.rietjens@student.tue.nl
Pieter Michels 1307789 Computer Science p.michels@student.tue.nl
Sterre van der Horst 1227255 Psychology and Technology s.a.m.v.d.horst1@student.tue.nl
Sven Bierenbroodspot 1334859 Automotive Technology s.a.k.bierenbroodspot@student.tue.nl

Problem Statement and Objective

At the moment 466 million people suffer from hearing loss, it has been predicted that this number will increase to 900 million by 2050. Hearing loss has, among other things, a social and emotional impact on ones life. The inability to communicate easily with others can cause an array of negative emotions such as loneliness, feeling of isolation and sometimes also frustration [1]. Although there are many different types of speech recognition technologies for live subtitling that can help people that are deaf or hard of hearing (DHH), these feelings can still be exacerbated during online meetings. DHH individuals must concentrate on the person talking, the interpretation, and of any potential interruptions that can occur [2]. Furthermore, to be able to take part in the discussion, they must be able to spontaneously react in conversation. However, not everyone understands sign language which makes communicating even more difficult. Nowadays, especially due to the COVID-19 pandemic, it is becoming more normal to work from home and therefore the number of online meetings is increasing quickly [3]. This leads us to our objective: to deveop software that translates Sign Language to text to help DHH individuals communicate in an online environment. This system will be a tool that DHH individuals can use to communicate during online meetings. The number of people that have to work or be educated from home has rapidly increased due to the COVID-19 pandemic [4]. This means that the number of DHH individuals that have to work in online environments also increases. Previous studies have shown that DHH individuals obtain lower score on an Academic Engagement Form for communication compared to students with no disability [5]. This finding can be explained by the fact that DHH people are usually unable to understand speech without aid. This aid can be a hearing aid, technology that convert speech to text, or even an interpreter, however the latter is expensive and not available for most DHH individuals. To talk to or react to other people, DHH individuals can use pen and paper, or in an online environment by typing. However, this is a lot slower than speech or sign language which makes it almost impossible for DHH individual to keep up with the impromptu nature of discussions or online meetings [6]. Therefore, by creating software that can convert sign language to text, or even to speech, DHH individuals will be able to actively participate in meeting. To do this, it is important to understand what sign language is. The following section of this wiki page, will explain the different elements of sign language and what it is.

Sign Language: what is it?

Sign language is a natural language that is predominantly used by people that are deaf or hard of hearing, but also by hearing people as well. Of all the children who are born deaf, 9 out of 10 are born to hearing parents. This means that the parents often have to learn sign language alongside the child [7].

Sign language is comparable to spoken language in the sense that it differs per country. American Sign Language (ASL) and British Sign Language (BSL) were developed separately and are therefore incomparable, meaning that people that use ASL will not necessarily be able to understand BSL [7].

It does not express single words, it expresses meanings. For example, the word right has two definitions. It means correct, and opposite of left. In spoken English, right is used for both meanings. In sign language, there are different signs for the different definitions of the word right. A single sign can also mean a whole entire sentence. By varying the hand orientation and direction, the meaning of the sign, and therefore the sentence, changes [8].

Having said that, all sign languages rely on certain parameters, or a selection of these parameters, to indicate meaning. These parameters are [9]:

  • Handshape: the general shape ones hands, and fingers make;
  • Location: where the sign is located in space, body and face are used as reference points to indicate location;
  • Movement: how the hands move;
  • Number of hands: this naturally refers to how many hands are used for the sign, and it also refers to the ‘relationship of the hands to each other’ ;
  • Palm orientation: this is how the forearm and wrist rotate when signing;
  • Non-manuals: this refers to the face and body. Facial expressions can be used for different meanings, or lexical distinctions. They can also be used to indicate mood, topics and aspect.

According to the study by Tatman, the first three parameters are universal in all sign languages. However, using facial expressions for lexical distinctions is something that is not used in most languages. The use of parameters also depends on cultural and cognitive context and feasibility of that parameters [9].

https://books.google.nl/books?id=dnxDgCvEnJoC&lpg=PR11&ots=03dwMQ7zBO&dq=sign%20language&lr&pg=PA1#v=onepage&q=sign%20language&f=false

USE

User

The target user group of our software are people who are not able to express themselves with speech. There are different causes for people not being able to express themselves. People who are deaf or hard of hearing often cannot speak. There are also people who cannot speak due to physical disabilities. When a person cannot express itself through speech the person usually is able to express itself through sign language. Because of this non-verbal manner of expression the target group is not able to function like a person who is able to speak in a meeting. This holds for both online and offline meetings. The solution for this is to hire an interpreter to translate the sign language into speech. This is somewhat expensive and not accessible by everyone. The software is meant to be accessible for everyone to make meetings possible for the target group.


Implementation

Our product will enable the user to express itself using not only text but also their facial expressions. This will help the user to function in meetings more fluid. This will allow the user to have more possibilities in the job market.

Society

Enterprise

Design concept

The user will participate in an online meeting using a camera with a sufficient quality. The video feed of the user will be extracted by the sign to text software. It will be able to detect when it should be extracting frames through the detection of a hand in the screen or through manual activation. The extracted frames will be used to compute the output using a neural network. The neural network will be trained using a dataset of sign language. It will be trained to detect single letters and numbers. The program will take the frames that are input and match it with one of the letters or numbers.

The output of the program is the sequence of letters and numbers the user has signed. The output will be displayed like subtitles would be. The delay of these subtitles will have to be as low as possible since this will increase the effect of facial expressions in their sentences.

Technical specifications

Classifier part: To accurately detect what a user (hearing impaired/mute/deaf person) wants to convey from video footage of them using sign language we have decided to look into making classifier models. In this scenario a classifier model should correctly classify (parts of) video footage to contain the signed language actually within that video footage. To make this classifier we have opted to use neural networks. To build a classifier you will need a dataset containing labelled images for the classes which the classifier is supposed to detect. This data is then divided in train, test and validation datasets where no 2 sets overlap. The train data set is the data on which the models of the neural network is trained, the validation set is used to compare performances between models and the test dataset is used to test the accuracy of the eventual generated models of the neural network. For sign language we have found two distinctly different types of sign language which require different approaches in how to make an effective classifier for them. Signs which consist of a single hand position and do not contain movement, such as the sign for the number 1, and signs throughout which the hands appear in different positions and thus naturally contain movement, such as the sign for butterfly. The main difference for the classifiers for these 2 types of signs is that all the information for the first type of signs can be detected from a single frame, while for the second type of signs you will require multiple frames to relay all the relevant information regarding the sign. As such henceforth we will refer to these types of signs as single-frame and multiple-frame signs. Building a classifier for strictly single-frame signs is (considerably) easier as the classifier only needs to look at a single frame. The input of the classifier will consist of a single image file with consistent resolutions. As such there is by default uniformity within the data and there is little need for pre-processing of the data. When inputted into the neural network, the image file will be turned by the computer as a matrix of pixels. For the images which we used for the single-frame signs for classifying alphabet signs, we used images of 200X200 pixels which were scaled grey. Scaling the images grey means the input for the neural network and its models will consist of a single matrix of 200 by 200 values. If we were to have used coloured images, then an image would consist of 3 layers of 2 dimensional matrices, one for each colour in RGB (Red, Green, Blue). To create the classifier for single-frame signs, we used a neural network with convolution layers and max pooling layers. The convolution layers in the neural network convolute a group of values in the matrix by applying a kernel on the group of values and returning a single value for the next layer of the neural network. The kernel can be for example 3X3 pixels and give only count the left most pixels and the centre pixel, the resulting value of this kernel would give the summed up values of the counted pixels. The kernel goes over all groups of the matrix and returns a new matrix with new values, the new layer of the neural network (this does not decrease the amount of values in the matrix, it would still be 200X200 as the values in the groups are not exclusive). In this way the convolutional layers are meant to extract high-level features from the images, such as where the edges of images or critical sections of images are. In addition to this max pooling layers are used to decrease the size of the matrices. A max pooling layer runs a kernel over a matrix but with a larger stride (the distance between the placement of the kernel) so less values are outputted as less groups of pixels are inspected. Max pooling simply returns the largest value in a kernel, which is meant to summarize the information in a group of pixels in a single pixel to smoothen out the layers in the neural network. After the convolution layers and max pooling layers have been applied to the original matrix representing the inputted image, a flattening layer is applied to turn the matrix into a single vector of values. On this single vector which now represents all the relevant information from the original image, we apply fully connected layers (also known as dense layers) which predicts the correct label for the inputted image. The fully connected layers apply weight to the values in the inputted vector and calculate the predicted probabilities for each class within our classifier. In the first few fully connected layers the ReLU (Rectified Linear Units) activation function is used to reduce the size of the vector. In the last fully connected layer which outputs the results for all the classes the softmax activation function is used to normalize the vector to output vectors between 0 and 1, denoting the probabilities for the inputted image to be each class.


Realization

Testing

Design evaluation

Week 1

Week 1 mostly consisted of putting together a group and decide upon a topic. We settled on the topic of emotion recognition on children with ASD. Research has been done on this topic and references to similar projects have been gathered. The focus for next week is to explore what is possible to achieve within this topic.

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (2h 30m), summarizing and further reading of sources (4h 30m)
Sterre van der Horst 1227255 10 preparing group meeting (30m), meeting with group - deciding subject (1h 30m), finding relevant sources (2h 30m), summarizing and reading sources (4h 30m), finding more sources (1h)
Pieter Michels 1307789 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (2h), summarizing and further reading of sources (5h)
Pim Rietjes 1321617 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (3h), summarizing and further reading of sources (4h)
Ruben Wolters 1342355 description

Week 2

In week 2 we decided after discussing the possible deliverables and came to the conclusion that it is difficult to find a dataset which we could use. The creation of a dataset is nearly impossible due to the slim target group and the current corona measures. for these reasons we abandoned the subject and discussed a new topic. The selected topic is to develop software which can convert sign language into text using video as an input.

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 7 meeting with supervisor (1h), ideation for new topics (2h), meeting deciding on new subject (1h), reading about new subject (3h)
Sterre van der Horst 1227255 13.75 preparing meeting with supervisor (45m), meeting with supervisor (1h), finding new sources about ASD (3h), analyzing new sources (2h), meeting deciding new subject (1h), finding new sources about new subject (3h), summarizing new sources (3h)
Pieter Michels 1307789 10 meeting with supervisor (1h), reading on old subject (2h), looking for databases on old subject (2h), meeting deciding on new subject (1h), reading about new subject (4h)
Pim Rietjes 1321617 11.5 meeting with supervisor (1h), reading on old subject (3h), looking for databases on old subject (3h), meeting deciding on new subject (1h), looking at databases for new subject (3.5h)
Ruben Wolters 1342355 description

Week 3

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 6 meeting with supervisor (1h), introducing wiki structure (1h), reading and summarizing sources (4h)
Sterre van der Horst 1227255 10.5 preparing meeting with supervisor (30m), meeting with supervisor (1h), finding and reading new sources (5h), writing introduction and problem statement (2h), rewriting problem statement/introduction and adding to wiki (2h)
Pieter Michels 1307789 11 meeting with supervisor (1h), reading and summarizing sources (4h), setting up coding environment (3h), Getting familiar with Tensorflow (3h)
Pim Rietjes 1321617 10 meeting with supervisor (1h), looking into example classifiers (2h), downloading and exploring datasets (3h), setting up coding environment (3h), getting familiar with Tensorflow (1h)
Ruben Wolters 1342355 meeting with supervisor (1h)

Week 4

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 5 research sign language (1h), group meeting (30m), research into user (3h), user text on wiki (30m)
Sterre van der Horst 1227255 6.5 research sign language (2h), writing section about what is sign language (2h), group meeting (30m), adding references to previously written problem statement and what is sign language (30m), first draft questionnaire (1.5h)
Pieter Michels 1307789 7,5 meeting with supervisor (1h), adding timetables to wiki (30m), investigate into tensorflow and Keras (2h 30m), group meeting (30m), Working on classifier (3h)
Pim Rietjes 1321617 12,5 meeting with supervisor (1h), looking into example classifiers (2h), downloading and exploring datasets (1h), preprocessing data (4h), group meeting (30m), Working on classifier (4h)
Ruben Wolters 1342355 description

Week 5

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 6,5 Video for different display possibilities (5h), Flowchart global design specs (1h 30m)
Sterre van der Horst 1227255 description
Pieter Michels 1307789 10h meeting with supervisor (1h), worked on the preprocessor for the classifier (5h), making sure preprocessor works (1h), working on state of the art section - which then got deleted for some unkown reason so I have to do it again :))))) (3h)
Pim Rietjes 1321617 13.5h meeting with supervisor (1h), writing text on the classifier (2.5h), calculating optical flow (moving signs) (4h), working on preprocessor (1h), working on neural network (static signs) (2h), working on neural network (moving signs) (3h)
Ruben Wolters 1342355 description

Week 6

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 Meeting with supervisor (1h)
Sterre van der Horst 1227255 description
Pieter Michels 1307789 description
Pim Rietjes 1321617 description
Ruben Wolters 1342355 description

Week 7

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 description
Sterre van der Horst 1227255 description
Pieter Michels 1307789 description
Pim Rietjes 1321617 description
Ruben Wolters 1342355 description

Week 8

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 description
Sterre van der Horst 1227255 description
Pieter Michels 1307789 description
Pim Rietjes 1321617 description
Ruben Wolters 1342355 description

References

  1. [1] Deafness and Hearing Loss - World Health Organization. (2021) WHO.
  2. [2] Peruma, A., & El-Glaly, Y. N. (2017). CollabAll: Inclusive discussion support system for deafand hearing students. ASSETS 2017 - Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, 315–316.
  3. [3] Microsoft Teams reaches 115 million DAU—plus, a new daily collaboration minutes metric for Microsoft 365 - Microsoft 365 Blog. (2021).
  4. [4] European Commission. (2020). Telework in the EU before and after the COVID-19 : where we were , where we head to. Science for Policy Briefs, 2009, 8 .
  5. [5] Richardson, J. T. E., Long, G. L., & Foster, S. B. (2004). Academic engagement in students with a hearing loss in distance education. Journal of Deaf Studies and Deaf Education, 9(1), 68–85.
  6. Glasser, A., Kushalnagar, K., & Kushalnagar, R. (2019). Deaf, Hard of Hearing, and Hearing perspectives on using Automatic Speech Recognition in Conversation. ArXiv, 427–432.
  7. 7.0 7.1 [6]Scarlett, W. G. (2015). American Sign Language. The SAGE Encyclopedia of Classroom Management.
  8. Perlmutter, D. M. (2013). What is Sign Language ? Linguistic Society of America, 6501(202).
  9. 9.0 9.1 [7] Tatman, R. (2015). The Cross-linguistic Distribution of Sign Language Parameters. Proceedings of the Annual Meeting of the Berkeley Linguistics Society, 41(January).