PRE2020 3 Group6: Difference between revisions

From Control Systems Technology Group
Jump to navigation Jump to search
Line 145: Line 145:
To create the classifier for single-frame signs, we used a neural network with convolution layers and max pooling layers. The convolution layers in the neural network convolute a group of values in the matrix by applying a kernel on the group of values and returning a single value for the next layer of the neural network. The kernel can be for example 3X3 pixels and give only count the left most pixels and the centre pixel, the resulting value of this kernel would give the summed up values of the counted pixels. The kernel goes over all groups of the matrix and returns a new matrix with new values, the new layer of the neural network (this does not decrease the amount of values in the matrix, it would still be 200X200 as the values in the groups are not exclusive). In this way the convolutional layers are meant to extract high-level features from the images, such as where the edges of images or critical sections of images are. In addition to this max pooling layers are used to decrease the size of the matrices. A max pooling layer runs a kernel over a matrix but with a larger stride (the distance between the placement of the kernel) so less values are outputted as less groups of pixels are inspected. Max pooling simply returns the largest value in a kernel, which is meant to summarize the information in a group of pixels in a single pixel to smoothen out the layers in the neural network.
To create the classifier for single-frame signs, we used a neural network with convolution layers and max pooling layers. The convolution layers in the neural network convolute a group of values in the matrix by applying a kernel on the group of values and returning a single value for the next layer of the neural network. The kernel can be for example 3X3 pixels and give only count the left most pixels and the centre pixel, the resulting value of this kernel would give the summed up values of the counted pixels. The kernel goes over all groups of the matrix and returns a new matrix with new values, the new layer of the neural network (this does not decrease the amount of values in the matrix, it would still be 200X200 as the values in the groups are not exclusive). In this way the convolutional layers are meant to extract high-level features from the images, such as where the edges of images or critical sections of images are. In addition to this max pooling layers are used to decrease the size of the matrices. A max pooling layer runs a kernel over a matrix but with a larger stride (the distance between the placement of the kernel) so less values are outputted as less groups of pixels are inspected. Max pooling simply returns the largest value in a kernel, which is meant to summarize the information in a group of pixels in a single pixel to smoothen out the layers in the neural network.


[[File:Kernel3x3.gif|800px|Image: 800 pixels|thumb|center|Kernel 3x3|]]{{cn}}
[[File:Kernel3x3.gif|800px|Image: 800 pixels|center|thumb|Kernel 3x3|]]  


After the convolution layers and max pooling layers have been applied to the original matrix representing the inputted image, a flattening layer is applied to turn the matrix into a single vector of values. On this single vector which now represents all the relevant information from the original image, we apply fully connected layers (also known as dense layers) which predicts the correct label for the inputted image. The fully connected layers apply weight to the values in the inputted vector and calculate the predicted probabilities for each class within our classifier. In the first few fully connected layers the ReLU (Rectified Linear Units) activation function is used to reduce the size of the vector. In the last fully connected layer which outputs the results for all the classes the softmax activation function is used to normalize the vector to output vectors between 0 and 1, denoting the probabilities for the inputted image to be each class.
After the convolution layers and max pooling layers have been applied to the original matrix representing the inputted image, a flattening layer is applied to turn the matrix into a single vector of values. On this single vector which now represents all the relevant information from the original image, we apply fully connected layers (also known as dense layers) which predicts the correct label for the inputted image. The fully connected layers apply weight to the values in the inputted vector and calculate the predicted probabilities for each class within our classifier. In the first few fully connected layers the ReLU (Rectified Linear Units) activation function is used to reduce the size of the vector. In the last fully connected layer which outputs the results for all the classes the softmax activation function is used to normalize the vector to output vectors between 0 and 1, denoting the probabilities for the inputted image to be each class.

Revision as of 18:23, 6 April 2021

Sign to text software

Group Members

Name Student ID Department Email address
Ruben Wolters 1342355 Computer Science r.wolters@student.tue.nl
Pim Rietjens 1321617 Computer Science p.g.e.rietjens@student.tue.nl
Pieter Michels 1307789 Computer Science p.michels@student.tue.nl
Sterre van der Horst 1227255 Psychology and Technology s.a.m.v.d.horst1@student.tue.nl
Sven Bierenbroodspot 1334859 Automotive Technology s.a.k.bierenbroodspot@student.tue.nl

Problem Statement and Objective

At the moment 466 million people suffer from hearing loss, it has been predicted that this number will increase to 900 million by 2050. Hearing loss has, among other things, a social and emotional impact on one's life. The inability to communicate easily with others can cause an array of negative emotions such as loneliness, feelings of isolation, and sometimes also frustration [1]. Although there are many different types of speech recognition technologies for live subtitling that can help people that are deaf or hard of hearing, hereafter referred to as DHH, these feelings can still be exacerbated during online meetings. DHH individuals must concentrate on the person talking, the interpretation, and of any potential interruptions that can occur [2]. Furthermore, to be able to take part in the discussion, they must be able to spontaneously react in conversation. However, not everyone understands sign language, which makes communicating even more difficult. Nowadays, especially due to the COVID-19 pandemic, it is becoming more normal to work from home and therefore the number of online meetings is increasing quickly [3]. This leads us to our objective: to develop software that translates Sign Language to text to help DHH individuals communicate in an online environment. This system will be a tool that DHH individuals can use to communicate during online meetings. The number of people that have to work or be educated from home has rapidly increased due to the COVID-19 pandemic [4]. This means that the number of DHH individuals that have to work in online environments also increases. Previous studies have shown that DHH individuals obtain a lower score on an Academic Engagement Form for communication compared to students with no disability [5]. This finding can be explained by the fact that DHH people are usually unable to understand speech without aid. This aid can be a hearing aid, technology that converts speech to text, or even an interpreter, however the latter is expensive and not available for most DHH individuals. To talk to or react to other people, DHH individuals can use pen and paper, or in an online environment by typing. However, this is a lot slower than speech or sign language which makes it almost impossible for DHH individual to keep up with the impromptu nature of discussions or online meetings [6]. Therefore, by creating software that can convert sign language to text, or even to speech, DHH individuals will be able to actively participate in meetings. To do this, it is important to understand what sign language is. The following section of this wiki page, will explain the different elements of sign language and what it is.

Sign Language: what is it?

Sign language is a natural language that is predominantly used by people who are deaf or hard of hearing, but also by hearing people as well. Of all the children who are born deaf, 9 out of 10 are born to hearing parents. This means that the parents often have to learn sign language alongside the child [7].

Sign language is comparable to spoken language in the sense that it differs per country. American Sign Language (ASL) and British Sign Language (BSL) were developed separately and are therefore incomparable, meaning that people that use ASL will not necessarily be able to understand BSL [7].

Sign language does not express single words, it expresses meanings. For example, the word right has two definitions. It could mean correct, or opposite of left. In spoken English, right is used for both meanings. In sign language, there are different signs for the different definitions of the word right. A single sign can also mean a whole entire sentence. By varying the hand orientation and direction, the meaning of the sign, and therefore the sentence, changes [8].

Having said that, all sign languages rely on certain parameters, or a selection of these parameters, to indicate meaning. These parameters are [9]:

  • Handshape: the general shape one's hands and fingers make;
  • Location: where the sign is located in space; body and face are used as reference points to indicate location;
  • Movement: how the hands move;
  • Number of hands: this naturally refers to how many hands are used for the sign, and it also refers to the ‘relationship of the hands to each other’ ;
  • Palm orientation: this is how the forearm and wrist rotate when signing;
  • Non-manuals: this refers to the face and body. Facial expressions can be used for different meanings or lexical distinctions. They can also be used to indicate mood, topics, and aspects.

According to the study by Tatman, the first three parameters are universal in all sign languages. However, using facial expressions for lexical distinctions is something that is not used in most languages. The use of parameters also depends on the cultural and cognitive context and feasibility of those parameters [9].

USE

In this section, we will highlight how each of the aspects of USE relates to our system.

User

The target user group of our software is people who are not able to express themselves with speech. There are different causes for people not being able to express themselves. People who are deaf or hard of hearing often cannot speak. There are also people who cannot speak due to physical disabilities. When a person cannot express themselves through speech, they are usually able to express themselves through sign language. Because of this non-verbal manner of expression, the target group is not able to function like a person who is able to speak in a meeting. This holds for both online and offline meetings. A direct solution for this is to hire an interpreter to translate sign language into speech, working as a proxy for the DHH person. An interpreter costs $18-$50 per hour, however, this can increase up to $125 an hour [10]. The average hourly wage in the USA is around $35 [11], meaning that DHH individuals that need to pay for interpretation out of pocket may not be able to do so. Furthermore, individuals that need this extra help, may be less attractive financially to employers because of the extra costs. The proposed software would eliminate the need for interpreters, and therefore all the aforementioned problems linked to this. Online meetings would become more inclusive for individuals that use sign-language as the software would allow those individuals to participate actively.


Besides this, people who work with, or are related to, DHH individuals are also members of our target group. Communication between DHH and non-DHH people can be difficult for both parties. Facilitating this communication either requires both persons involved to speak some form of sign language or both using text. Some DHH individuals can also Neither of these options seem realistic in a professional meeting environment. Using only text is extremely slow and inefficient, but is accessible to everyone. On the other hand, sign language is much more responsive, but not everyone has access to it. While our software does not directly affect this side of the target group, it does allow for more inclusivity of DHH people in the professional world, particularly in the current state of working life due to the COVID crisis.

Finally, our product will enable the user to express themselves using not only text but also their facial expressions. This will help the user to function in meetings more easily. In the past 10 years, the number of people working from home, teleworking, has increased slowly, depending on the sector and occupation. However, since the start of the COVID-19 pandemic, it has been estimated that around 40% of working individuals in the EU switched to working from home full-time [12]. It has also been suggested that this shift in work culture is something that could stay, even after the COVID-19 pandemic. Therefore, by allowing individuals that rely on sign-language to communicate naturally, teleworking will become as easy for them as anyone else. Which could also eventually lead to more possibilities in the job market as well.

Society

Since the beginning of the Covid-19 pandemic a lot of people started working from home. A study shows that 60% of all UK citizens are working from home at this moment [13]. But this will not be only temporary, the same research shows that 26% of those people plan to continue working from home permanentely of occasionally after the lockdown. Working from home opposes challenges for certain groups of people, for example, the DHH community. The society will be less inclusive when this community is not able to function properly. Today inclusivity is more important than ever before and the society is working on improving it. The government, employers, and educational institutions are affected in the sense that they can live up to any inclusivity policies they have. As mentioned before, communicating online can be difficult for people that use sign language. The inability of those individuals to respond (quickly) decreases inclusivity. This software would increase inclusivity again, which is important to stakeholders such as the government, employers, and educational institutions.

Working from home has the benefit that there is no commute to spend time and money on. These advantages will also benefit the environment due to the reduced travelling polution. These advantages should be accesible for every individual when it fits their job. The proposed technology will give them the possibility to take advantage of these benefits. Also due to lack of communication skills of the DHH community some coworkers might not be preferring working together with a DHH coworker. The technology could help combat the preconcieved notions of these people

Enterprise

As mentioned previously, an important aspect of the software is the integration into a professional work environment. Since most work-related meetings happen online nowadays, it is only natural that through the use of technology we provide as much opportunity for people who are usually unable to be involved in such a line of work to now do so. An especially common platform for online meetings is Microsoft Teams, which is used in over 500,000 organizations to facilitate meetings[14]. Within this platform, there is the possibility of integrated apps. The vision we have for our software is to then create such an integration with Microsoft Teams to allow DHH to more easily take part in discussions.

During an online meeting, it is important to speak clearly but due to the conversational nature it is faster than simple clear speech. Clear speech has a speaking rate of 100 words per minute (wpm), while conversational speech is 200 wpm. One could therefore assume that for online meetings, a comfortable rate would be around 150 wpm [15]. The average typing rate is around 50 words per minute for an untrained typist [16]. By having to solely rely on typing during online meetings, the communication rate decreases by 67% which is very inefficient. Financially this would mean that an hour of (sign-to-)speech communication would take 1.67 hours if the employee had to type, which is, based on average wage, $58 instead of $35. This is an extreme waste of money for employers, meaning that they would profit greatly from this software as well.

Design concept

The user will participate in an online meeting using a camera with a sufficient quality. The video feed of the user will be extracted by the sign to text software. It will be able to detect when it should be extracting frames through the detection of a hand in the screen or through manual activation. The extracted frames will be used to compute the output using a neural network. The neural network will be trained using a dataset of sign language. It will be trained to detect single letters and numbers. The program will take the frames that are input and match it with one of the letters or numbers.


The output of the program is the sequence of letters and numbers the user has signed. The output will be displayed like subtitles would be. The delay of these subtitles will have to be as low as possible since this will increase the effect of facial expressions in their sentences.

As mentioned before, the software will be integrated in teams. Users will be able to select whether they want speech, video and/or text using a pop-up menu. By navigating to the advanced settings menu (by clicking on the elipses in the bottom right corner of the main pop-up menu), they will be able to adjust speech rate of the text-to-speech software, the font, font colour and the size of the displayed text.

0LAUK0 Teams Integration.jpg

Unfortunately, it appears to be impossible to capture the video feed that is sent to Microsoft Teams meetings.


Luckily, a solution was found for this problem. Using the openCV library for python a program can be written to display footage captured by a camera or webcam to a window. Furthermore, before displaying a frame of a video it is possible to also add some text to this frame. Extending this from one frame to multiple frames comes down to the trivial task of keeping track of a counter.


Sadly, getting the frames on this window to the teams meeting requires the use of a third-party program. This program needs the functionality of virtual cameras. One program that comes to mind is the widely used OBS or Open Broadcaster Software. This software allows users to capture the window with the video and text and broadcast it to the virtual camera. Teams can then use this virtual camera as its source, instead of the standard webcam.

State of the art

Research into sign language, luckily, has already been done many times before this project. The challenges that previous investigations have run into are well documented and defined, allowing us to work around them and being able to identify them early in the project. Furthermore, due to the large amount of research already done into sign language recognition it is easier for us to identify what methods to pick for this project.


In this state-of-the-art section the most up to date findings will be shortly discussed. First, the challenges that this project can face are considered. Second, several different approaches that have been taken are reviewed. Finally, out of all this information the method that has been chosen for this project will be explained.

Challenges

The first and most obvious challenge originates from sign language itself. Signs are often not identifiable based on one frame or image. Almost all signs require some sort of motion which is simply not possible to capture in an image. Therefore, whatever classifier is decided on for this project, it is most likely going to have to be able to identify signs from video, which is significantly harder than recognizing signs from just one image. Classifying signs from video, although much more difficult than classifying from images, is not impossible. An example of a successful sign language recognizer is given by Pu, Junfu, Zhou, Wengang and Li, Houqiang in their 2019 paper “Iterative Alignment Network for Continuous Sign Language Recognition” [17]. In this paper a method is proposed to map some T number of frames to an amount L of words or signs. Sadly, this method combines several highly advanced neural network techniques, which are for sure out of the scope of this course. Furthermore, this method is designed specifically for use with 3D videos, which is something users of this product might not have access to.


Furthermore, signs can be executed at different speeds. Obviously, there will be marginal differences between every other signer, but a significant speed difference in the execution of a sign can also carry some meaning. For instance, by doing the sign for running much faster than normal the signer can convey that he had to run very fast to get somewhere.


One more issue that sign language recognizers can run into is the use of depth in signs. A good example of a sign where this can become problematic is the sign for ‘gray’ in American Sign Language (ASL), which is done as shown in this video: https://www.signingsavvy.com/search/gray. Notice how the signer must wave his or her hands through each other for a few seconds. Using a 2D camera this depth is impossible to capture. Evidently, requiring users of this product to have a 3D camera or a setup with multiple cameras is not a feasible option.


Moreover, signs can look very much like each other. Again, the sign for ‘gray’ comes to mind. The signs for both ‘gray’ and ‘whatever’ are very similar (see this link for how ‘whatever’ can be signed: https://www.signingsavvy.com/search/whatever). Both signs have the signer moving their hands back and forth with very little difference in hand gesture. A way around this problem can be found in the facial expression of the signer. As shown by U. von Agris, M. Knorr and K. Kraiss the facial expressions that are made during signing are key to recognizing the correct sign [18]. This is also something that can be seen if attention is paid to the face of the signers in the videos for the ‘gray’ and ‘whatever’ signs. During execution of the sign, it is almost like the signer is saying the word out loud. The only difference of course being that no sound is made. However, as shown in the paper, this is not the only facial feature that can help with classification. As shown in the paper, the facial expression of the signer can change the meaning of a sign. Therefore, to make even better sign language recognition software, the facial expressions should be considered as well.


The fact that signs can look very much alike is not really an issue that is specific to this project. In general, for all neural networks it will be very hard to classify two images that are very similar. Even when adding another dimension this problem is not easy to solve for hand gestures, as shown by L. K. Phadtare, R. S. Kushalnagar and N. D. Cahill [19]. In this paper an algorithm is proposed to detect hand gestures using the Kinect, a camera that has depth perception. The algorithm is shown to be incredibly accurate in distinguishing hand gestures that are very different from one another. However, the authors also show that the smaller the difference between gestures, the less accurate the classification will be.


However, the fact that the face can impact the meaning of signs so much means that the classifier should look at more than just the hands, adding another layer of difficulty to this already hard problem. Luckily, this is an issue that has already been looked at before by K. Imagawa, Shan Lu and S. Igi [20]. In this paper methods to distinguish specifically the face and hands from a 2D image are discussed. The idea of the proposed solution is to use a color mapping based on the skin color of the signer to identify hand and face ‘blobs’ in the image. Although the findings in this paper are very interesting, the implementation difficulty and time must be considered.


Another thing that might be problematic specifically for this project is the fact that the background used in the images or videos that need to be classified can be very cluttered. This can lead to the classifier not being able to distinguish between the hands of the signer and the background, leading to wrong classifications. Although in meetings it is usually customary to have a clear background, this is not something that can be expected of potential users of this product.

Approaches

To make sign language recognizers several approaches have been taken throughout the years. Some of these approaches have been highlighted by Cheok, M.J., Omar, Z. and Jaward, M.H. in their article reviewing existing hand gesture and sign recognition techniques [21]. In this article the several approaches have been split into two major classes: vision and sensor based.

Vision based

Vision based sign language recognition, as the name implies, uses videos or images acquired using one or more cameras and possibly extended with extra more sophisticated techniques.


The easiest approach – to set up at least - in the vision-based category is the one using only one camera. This would result in some recognizer classifying signs based on 2D images or videos. However, this one camera could be extended to use a combination of several cameras in order to give the data representation some sense of depth.


Both approaches can be expanded on using markers to more easily identify and even track hands. Examples of this are mentioned in the article by Cheok, M.J., Omar, Z. and Jaward, M.H. as well: signers can be asked to wear two differently colored gloves or little wristbands. These colored gloves for instance were used in the paper mentioned earlier by K. Imagawa, Shan Lu and S. Igi [22].

Sensor based

Sensor based sign language recognition although very promising and interesting is also much harder to make and to set up. Sensor based approaches require the measuring of something using one or more sensors or instruments. These sensors could be used to track for instance the position of the hands during signing, resulting in a more abstract ‘image’ of where the hands are during a sign. However, the sensors could also be used to measure things like velocity and acceleration, which as discussed earlier, is probably very helpful since most signs can not be made by holding your hands in one position. One step further could mean biological signals, like the number of electrical pulses through the muscles required to make some sign.

Our solution

The solution that was used in this project mostly came down to what can be expected from users of the product. The goal of this project is a sign language to text interpreter so deaf people can participate in online meetings more easily. Since it just can not be expected of users to have expansive 3D cameras or to get all kinds of equipment before the classifier works it seemed kind of obvious to go for the easiest to implement solution: the product will be convolutional neural network that can classify 2D images to the correct signs. Do note that this being the ‘easiest’ solution does not imply that it will work very well, in fact, due to the lack of information caused by only having access to a 2D image or video feed the product might actually end up being very bad at recognizing sign language.

Technical specifications

To accurately detect what the user wants to convey from video footage of them using sign language we have decided to look into making classifier models. In this scenario a classifier model should correctly classify certain frames of the video footage to contain the signed language actually within that video footage.

To make this classifier we have opted to use neural networks. To build a classifier you will need a dataset containing labelled images for the classes which the classifier is supposed to detect. This data is then divided in train, test and validation datasets where no 2 sets overlap. The train data set is the data on which the models of the neural network is trained, the validation set is used to compare performances between models and the test dataset is used to test the accuracy of the eventual generated models of the neural network.

For sign language we have found two distinctly different types of sign language which require different approaches in how to make an effective classifier for them. Signs which consist of a single hand position and do not contain movement, such as the sign for the number 1, and signs throughout which the hands appear in different positions and thus naturally contain movement, such as the sign for butterfly. The main difference for the classifiers for these 2 types of signs is that all the information for the first type of signs can be detected from a single frame, while for the second type of signs you will require multiple frames to relay all the relevant information regarding the sign. As such henceforth we will refer to these types of signs as single-frame and multiple-frame signs.

Building a classifier for strictly single-frame signs is (considerably) easier as the classifier only needs to look at a single frame. The input of the classifier will consist of a single image file with consistent resolutions. As such there is by default uniformity within the data and there is little need for pre-processing of the data. When inputted into the neural network, the image file will be turned by the computer as a matrix of pixels. For the images which we used for the single-frame signs for classifying alphabet signs, we used images of 200X200 pixels which were scaled grey. Scaling the images grey means the input for the neural network and its models will consist of a single matrix of 200 by 200 values. If we were to have used coloured images, then an image would consist of 3 layers of 2 dimensional matrices, one for each colour in RGB (Red, Green, Blue).

To create the classifier for single-frame signs, we used a neural network with convolution layers and max pooling layers. The convolution layers in the neural network convolute a group of values in the matrix by applying a kernel on the group of values and returning a single value for the next layer of the neural network. The kernel can be for example 3X3 pixels and give only count the left most pixels and the centre pixel, the resulting value of this kernel would give the summed up values of the counted pixels. The kernel goes over all groups of the matrix and returns a new matrix with new values, the new layer of the neural network (this does not decrease the amount of values in the matrix, it would still be 200X200 as the values in the groups are not exclusive). In this way the convolutional layers are meant to extract high-level features from the images, such as where the edges of images or critical sections of images are. In addition to this max pooling layers are used to decrease the size of the matrices. A max pooling layer runs a kernel over a matrix but with a larger stride (the distance between the placement of the kernel) so less values are outputted as less groups of pixels are inspected. Max pooling simply returns the largest value in a kernel, which is meant to summarize the information in a group of pixels in a single pixel to smoothen out the layers in the neural network.

Kernel3x3.gif

After the convolution layers and max pooling layers have been applied to the original matrix representing the inputted image, a flattening layer is applied to turn the matrix into a single vector of values. On this single vector which now represents all the relevant information from the original image, we apply fully connected layers (also known as dense layers) which predicts the correct label for the inputted image. The fully connected layers apply weight to the values in the inputted vector and calculate the predicted probabilities for each class within our classifier. In the first few fully connected layers the ReLU (Rectified Linear Units) activation function is used to reduce the size of the vector. In the last fully connected layer which outputs the results for all the classes the softmax activation function is used to normalize the vector to output vectors between 0 and 1, denoting the probabilities for the inputted image to be each class.

Conventional CNN structure
[citation needed]

Realization

Data

For the models which we have made we have made use of several databases. For the single frame classifiers we have made use of ASL (American sign language) alphabet datasets [23] [24]. Both of the datasets which we used for this contained around 3000 images per class (27-29 classes, alphabet + space/delete/empty).

For the multi-frame database we made use of the Isolated Gesture Recognition (ICPR '16) dataset [25]. This dataset comes from Chalearn LAP (Looking at People) organisation, and contains video-files of people performing a sign. The dataset contains 249 classes/sign gestures, which are each performed by 21 different individuals. In total the dataset contains 47933 video files.

Classifiers

Single-frame Iteration 1

The first iteration of a classifier which we made was trained on a data set containing only single framed signs. This was easier to start out with as single frame signs require far less pre-processing and is far easier to incorporate into a video stream from a webcam. For the first iteration we started with the numbers 1 to 10 as these signs are relatively easy to distinguish from each other and only require you to single out the handshape of a single hand. For this first iteration we made the used data ourselves and did not yet use a (large) tested dataset. To make this data we used a clear background and calculated the average pixel-values of the captured image of the background, to then when a hand moves into the picture we simply have to look at which saw a significant change in value to find the contours of the hand. This allowed us to capture frames containing the hand contours easily, which contains the most relevant data in the picture. We used a webcam to record our hands in the positions of the 10 numbers for a few seconds while moving our hands slightly to create some variance, till we had captured about 700 frames (a few seconds of footage). We then trained a CNN on this data to create a classifier model. Having made a classifier model, we started looking into connecting this to a video stream from our webcam to test the effectivity of our model when applied in real life. Since our model takes in only black and white pictures containing the contours of the hand, this also means that the frames captured from the video stream of a webcam first need to be turned into such frames. As such in the demo which we made for this classifier, you first need to calculate the background of the environment of the user. As such it is also crucial that this background (and the lighting within it) is constant, as you do not want to recalculate the background of the image before every time you sign. (input something showing the demo + some explanation). The downside of this implementation is that you require the user to have a meticulously specific setup for their computer and work environment as the background cannot be to cluttered and obstructive, nor can people walk or move in the background without confusing the model. Furthermore it is important that the lighting in the image is from a good angle as to not cast shadows on the background, which are taken in as significant changes in the model. Therefore we found that a classifier constructed in this way is to restrictive to yield models useful for a wider array of users.

Single-frame Iteration 2

Therefore in the second classifier which we constructed we did not pre-process our data by calculating the background of images and calculating the threshold values to create an image solely containing the hand contours. For our second classifier model we opted to just use frames containing hands in single framed signs (this time we looked at signs for the Roman alphabetical letters) and used larger (pretested) databases. The databases which we initially used can be found on (insert Kaggle website) and contains about 3000 image files per class. We trained these once again on a CNN. On the first models which we made for this data we did not find satisfying results, as the model did not classify to a high accuracy when tested on new data. We assume this was a result of overfitting, as the image files on which the model was trained were all of the same persons hands and with the same background. Therefore we searched for a more diversified dataset to run the neural network on. After having found a more varied dataset (in this case a dataset with hands from multiple different people with multiple different backgrounds), we ran the CNN on this data to form new models. Sadly though the new models which were created did not yield satisfying accuracy rates, and still misclassified test images most of the time. One of the factors which may have prevented the CNN from creating models with a high accuracy is that not all the images were of the same resolution and thus had to be reshaped to acquire uniformity. This means some pictures were inputted flattened or widened, and may have prevented the CNN from finding useful patterns to indicate which class a picture belonged to.


Multi-frame Iteration 1

While we were working on our second single-frame classifier we also started to look at creating a classifier for multiple frame signs. For the multiple-frame classifier we have used the Isolated Gesture Recognition (ICPR '16) dataset [25]. This dataset contains a large degree of variance, as the recorded signs have been performed by many different people. This helps when making a classifier as it reduces the chance of producing an overfitted classifier model. To work with this dataset we have looked among others at this source (https://github.com/FrederikSchorr/sign-language), and have pre-processed the data by turning the video files which are in the dataset in frames and optical flow frames. The video files were all turned into 40 frame selections, to adhere to the uniformity which is required for a neural network input (since every input should contain an equal amount of values). Calculating the optical flow is done as research (input some sources) has indicated that this largely improves upon the accuracy of the produced models. This does however give the downside of a larger computational cost when effectively using this model as computing optical flow is rather expensive computationally. Since the input of the neural network will consist of a set of optical flow images, the input of the produced classifier will also require optical flow images. This means that when the model is used live, that the optical flow must also be computed live. This means there will be a relative delay in relation to the computable power of the device used by the user. We have tested the model on a desktop with a dedicated graphics card (Nvidia GeForce GTX 1070 Ti), with these specifications computing the optical flow takes at most 2-3 seconds. Thus the use of a classifier which intakes optical flow images can limit the accessibility of the model for users, especially those with devices with limited computational power. As such the model’s speed and thus usefulness may vary.

To create this model we made use of a (partly) pre-trained Inflated 3d Inception architecture neural network (https://arxiv.org/abs/1705.07750) which is also trained on optical flow data. A keras implementation of this model was used (https://github.com/dlpbc/keras-kinetics-i3d).

We first made a multiple-frame classifier for all the classes (249 signs) in the chalearn dataset, though this proved to be computationally very expensive as running a single epoch could take upwards of 2 hours. The model which we produced with it was only run on 3 epochs and merely reached an accuracy of around 56%. The sources which we had looked at also indicated that making a classifier for all the classes of this dataset is complex, as there is such a large number of classes to test. Since the dataset also contains signs in multiple different languages, this may also add to the difficulty of making an accurate classifier for the dataset as a whole.

As such we afterwards ran the neural network with a subset of the (Chalearn) dataset. We used in total 31 classes, which consisted of 2 coherent sets of signs within the dataset. These 2 sets of signs contain signs on SWAT hand signals and Canadian aviation ground circulation. On this smaller set we ran 20 epochs, as it is was not as computationally expensive. This yielded a model with around 80% accuracy.

Model Usage

After the classifying models were made, we made some demo program to connect them in python to a webcam video stream so the models can be used to classify live footage. In the pictures (insert picture reference) you can see the demo program capture footage of the sign Freins (a Canadian aviation related sign) and return an output to the user. In the first video you can see the default state of the program, nothing is being computed and at the top of the screen there is an indication that the user should press the spacebar to start. The default state is also indicated by black text and bars as visual cue. In picture 2 you can see that after the spacebar has been pressed that the program will enter a waiting stage which lasts 3 seconds before the sign recording starts, which is also indicated by orange colouring of the text and bars. In picture 3 you can see that the user being recorded while the user is performing a sign, this recording always lasts 5 seconds. The recording state is also indicated by the red text and bars. After the recording is finished the program enters a translating state, where the video footage is extrapolated to 40 frames and the optical flow is computed as can be seen to the right of the camera view. Afterwards (still in the translating state) the optical flow frames are ran through the model to get a prediction of the sign. This state is once again indicated in orange. After the program is done with translating the sign, it will return to the default state where the predicted sign and its prediction rate are shown in the bottom left of the camera view. This can be seen in picture 5.

Frein Classification 1
[citation needed]
Frein Classification 2
[citation needed]
Frein Classification 3
[citation needed]
Frein Classification 4
[citation needed]
Frein Classification 5
[citation needed]

Testing

Test plan

Scope

The first scenario to be tested will be the 'out of context' scenario. In this situation we will test the software by making sure everything is set up correctly for a person to make some signs. There will be no other speech or discussions during this scenario. This scenario will be used to validate if the software will actually function as expected. There will be as few as possible variables considered.

The second scenario is the real world testing. In this scenario the group members will start a discussion, then one of the members will express itself through sign language. This scenario is to test whether the software correctly realises when it will need to translate, but also to test if there is not too much delay to keep the conversation alive.


Out of scope

We are not going to cover the actual implementation in MS teams. We are also not going to cover the multiple-frame signs. We will also not include subject who actually suffer from not being able to talk.


Assumptions


Schedule

preparation: Subject person has to have to program installed. The camera of the subject needs to be of sufficient quality with sufficient lighting. The subject has to know the signs of the letters of its name and its age.

Test 1: Camera responsible starts a recording of the screen. The subject will also start a recording. Both recordings also need to include a clock with the accurate time. The software is prepared to translate and in the case this is necessary the subject will indicate the start of each sign to the software. The subject will sign its name and its age. The output of the program will be logged with timestamps.

Test 2: Camera responsible starts a recording of the screen. The subject will also start a recording. Both recordings also need to include a clock with the accurate time. An observer will ask the subject what its name and age is. The subject will respond in sign language. The output of the program will be logged with timestamps.


Roles and responsibilities

Subject: Camera responsible: observer(s):


Tools


Exit criteria

Design evaluation

Week 1

Week 1 mostly consisted of putting together a group and decide upon a topic. We settled on the topic of emotion recognition on children with ASD. Research has been done on this topic and references to similar projects have been gathered. The focus for next week is to explore what is possible to achieve within this topic.

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (2h 30m), summarizing and further reading of sources (4h 30m)
Sterre van der Horst 1227255 10 preparing group meeting (30m), meeting with group - deciding subject (1h 30m), finding relevant sources (2h 30m), summarizing and reading sources (4h 30m), finding more sources (1h)
Pieter Michels 1307789 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (2h), summarizing and further reading of sources (5h)
Pim Rietjes 1321617 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (3h), summarizing and further reading of sources (4h)
Ruben Wolters 1342355 8,5 meeting with group - deciding subject (1h 30m), gathering and reading sources (3h), summarizing and further reading of sources (4h)

Week 2

In week 2 we decided after discussing the possible deliverables and came to the conclusion that it is difficult to find a dataset which we could use. The creation of a dataset is nearly impossible due to the slim target group and the current corona measures. for these reasons we abandoned the subject and discussed a new topic. The selected topic is to develop software which can convert sign language into text using video as an input.

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 7 meeting with supervisor (1h), ideation for new topics (2h), meeting deciding on new subject (1h), reading about new subject (3h)
Sterre van der Horst 1227255 13.75 preparing meeting with supervisor (45m), meeting with supervisor (1h), finding new sources about ASD (3h), analyzing new sources (2h), meeting deciding new subject (1h), finding new sources about new subject (3h), summarizing new sources (3h)
Pieter Michels 1307789 10 meeting with supervisor (1h), reading on old subject (2h), looking for databases on old subject (2h), meeting deciding on new subject (1h), reading about new subject (4h)
Pim Rietjes 1321617 11.5 meeting with supervisor (1h), reading on old subject (3h), looking for databases on old subject (3h), meeting deciding on new subject (1h), looking at databases for new subject (3.5h)
Ruben Wolters 1342355 9 meeting with supervisor (1h), reading on old subject (2h), looking for databases on old subject (2h), meeting deciding on new subject (1h), looking at databases for new subject (2h)

Week 3

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 6 meeting with supervisor (1h), introducing wiki structure (1h), reading and summarizing sources (4h)
Sterre van der Horst 1227255 10.5 preparing meeting with supervisor (30m), meeting with supervisor (1h), finding and reading new sources (5h), writing introduction and problem statement (2h), rewriting problem statement/introduction and adding to wiki (2h)
Pieter Michels 1307789 11 meeting with supervisor (1h), reading and summarizing sources (4h), setting up coding environment (3h), Getting familiar with Tensorflow (3h)
Pim Rietjes 1321617 10 meeting with supervisor (1h), looking into example classifiers (2h), downloading and exploring datasets (3h), setting up coding environment (3h), getting familiar with Tensorflow (1h)
Ruben Wolters 1342355 5 meeting with supervisor (1h), setting up coding environment (3h), getting familiar with Tensorflow (1h)

Week 4

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 5 research sign language (1h), group meeting (30m), research into user (3h), user text on wiki (30m)
Sterre van der Horst 1227255 6.5 research sign language (2h), writing section about what is sign language (2h), group meeting (30m), adding references to previously written problem statement and what is sign language (30m), first draft questionnaire (1.5h)
Pieter Michels 1307789 7,5 meeting with supervisor (1h), adding timetables to wiki (30m), investigate into tensorflow and Keras (2h 30m), group meeting (30m), Working on classifier (3h)
Pim Rietjes 1321617 12,5 meeting with supervisor (1h), looking into example classifiers (2h), downloading and exploring datasets (1h), preprocessing data (4h), group meeting (30m), Working on classifier (4h)
Ruben Wolters 1342355 4,5 meeting with supervisor (1h), group meeting (30m), working on classifier (3h)

Week 5

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 6,5 Video for different display possibilities (5h), Flowchart global design specs (1h 30m)
Sterre van der Horst 1227255 description
Pieter Michels 1307789 10h meeting with supervisor (1h), worked on the preprocessor for the classifier (5h), making sure preprocessor works (1h), working on state of the art section - which then got deleted for some unkown reason so I have to do it again :))))) (3h)
Pim Rietjes 1321617 13.5h meeting with supervisor (1h), writing text on the classifier (2.5h), calculating optical flow (moving signs) (4h), working on preprocessor (1h), working on neural network (static signs) (2h), working on neural network (moving signs) (3h)
Ruben Wolters 1342355 3 meeting with supervisor (1h), working on preprocessor (2h)

Week 6

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 8h Meeting with supervisor (1h), Design concept writing (2h), User writing (2h), Test plan writing (3h)
Sterre van der Horst 1227255 description
Pieter Michels 1307789 10h Meeting with supervisor (1h), rewriting/checking the state-of-the-art section (5h), investigating how to integrate classifier with teams (4h)
Pim Rietjes 1321617 13.5h Meeting with supervisor (1h), working on single frame classifier/demo(2h 30m), fixing bugs/errors in multiple frame CNN (5h), creating classifier model with mulitiple frame CNN (3h), working on demo multiple frame CNN (1h), writing text technical spec (1h)
Ruben Wolters 1342355 5h Meeting with supervisor (1h), working on classifier (2h 30m), proof-reading and correcting wiki sections (1h 30m)

Week 7

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 description
Sterre van der Horst 1227255 description
Pieter Michels 1307789 description
Pim Rietjes 1321617 8.5h Meeting with supervisor (1h), Meeting feedback user side (.5h), Analysing/preprocessing larger single-frame dataset (1h), Running single-frame model (2.5h), Making subset multi-frame dataset (1h), Running model multi-frame model (2h), testing multi-frame model (.5h)
Ruben Wolters 1342355 description

Week 8

Name Student ID Hours Description
Sven Bierenbroodspot 1334859 description
Sterre van der Horst 1227255 description
Pieter Michels 1307789 description
Pim Rietjes 1321617 description
Ruben Wolters 1342355 description

References

  1. [1] Deafness and Hearing Loss - World Health Organization. (2021) WHO.
  2. [2] Peruma, A., & El-Glaly, Y. N. (2017). CollabAll: Inclusive discussion support system for deaf and hearing students. ASSETS 2017 - Proceedings of the 19th International ACM SIGACCESS Conference on Computers and Accessibility, 315–316.
  3. [3] Microsoft Teams reaches 115 million DAU—plus, a new daily collaboration minutes metric for Microsoft 365 - Microsoft 365 Blog. (2021).
  4. [4] European Commission. (2020). Telework in the EU before and after the COVID-19 : where we were, where we head to. Science for Policy Briefs, 2009, 8 .
  5. [5] Richardson, J. T. E., Long, G. L., & Foster, S. B. (2004). Academic engagement in students with a hearing loss in distance education. Journal of Deaf Studies and Deaf Education, 9(1), 68–85.
  6. Glasser, A., Kushalnagar, K., & Kushalnagar, R. (2019). Deaf, Hard of Hearing, and Hearing perspectives on using Automatic Speech Recognition in Conversation. ArXiv, 427–432.
  7. 7.0 7.1 [6]Scarlett, W. G. (2015). American Sign Language. The SAGE Encyclopedia of Classroom Management.
  8. Perlmutter, D. M. (2013). What is Sign Language ? Linguistic Society of America, 6501(202).
  9. 9.0 9.1 [7] Tatman, R. (2015). The Cross-linguistic Distribution of Sign Language Parameters. Proceedings of the Annual Meeting of the Berkeley Linguistics Society, 41(January).
  10. [8] How Much Does A Sign Language Interpreter Cost | UTS. (n.d.) Retrieved March 25, 2021.
  11. [9] Table B-3. Average hourly and weekly earnings of all employees on private nonfarm payrolls by industry sector, seasonally adjusted. Retrieved March 25, 2021.
  12. [10] European Commission. (2020). Telework in the EU before and after the COVID-19 : where we were , where we head to. Science for Policy Briefs, 2009, 8.
  13. [11]
  14. [12] David Curry, BussinessofApps.com, Microsoft Teams Revenue and Usage Statistics (2021)
  15. [13] Yoo, J., Oh, H., Jeong, S., & Jin, I. K. (2019). Comparison of speech rate and long-term average speech spectrum between Korean clear speech and conversational speech.
  16. [14] Dhakal, V., Feit, A. M., Kristensson, O., & Oulasvirta, A. (2018). Observations on Typing from 136 Million Keystrokes.
  17. Pu, Junfu & Zhou, Wengang & Li, Houqiang. (2019). Iterative Alignment Network for Continuous Sign Language Recognition. 4160-4169. https://doi.org/10.1109/CVPR.2019.00429
  18. U. von Agris, M. Knorr and K. Kraiss, "The significance of facial features for automatic sign language recognition," 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, Amsterdam, Netherlands, 2008, pp. 1-6, doi: 10.1109/AFGR.2008.4813472.
  19. L. K. Phadtare, R. S. Kushalnagar and N. D. Cahill, "Detecting hand-palm orientation and hand shapes for sign language gesture recognition using 3D images," 2012 Western New York Image Processing Workshop, Rochester, NY, USA, 2012, pp. 29-32, doi: 10.1109/WNYIPW.2012.6466652.
  20. K. Imagawa, Shan Lu and S. Igi, "Color-based hands tracking system for sign language recognition," Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998, pp. 462-467, doi: 10.1109/AFGR.1998.670991.
  21. Cheok, M.J., Omar, Z. & Jaward, M.H. A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. & Cyber. 10, 131–153 (2019). https://doi.org/10.1007/s13042-017-0705-5
  22. K. Imagawa, Shan Lu and S. Igi, "Color-based hands tracking system for sign language recognition," Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998, pp. 462-467, doi: 10.1109/AFGR.1998.670991.
  23. [15] ASL Alphabet dataset - Kaggle (Accessed March 2021).
  24. [16] Significant ASL Alphabet dataset - Kaggle (Accessed March 2021).
  25. 25.0 25.1 [17] Isolated Gesture Recognition dataset (ICPR '16) - Chalearn Looking at People(Accessed March 2021).