CS298 Proposal

Sign Language Assistant.

Charulata Lodha (charulatamahesh.lodha@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun, Shruti Kothari

Abstract:

American Sign Language (ASL) is not sign language English. It is a visual language and the signs are used to convey ideas and concepts rather than actual words. As a result, there comes an undeniable communication barrier between the ASL and English speaking population. The deaf and mute people need an accessibility technology to fully understand the communications to have the same experience anywhere anytime. We are developing a prototype computer vision system to help the deaf and mute communicate in a shopping center. Our goal system is to use video feeds to recognize ASL gestures and notify shop clerks of deaf and mute patrons' intents. Our prototype will operate on videos created in Unity of 3D humanoid models in a shop setting performing ASL signs.

CS297 Results:

Studied LeNet5 architecture and implemented it to detect ASL alphabets using Kaggle s ASL image datasets.
Learned to build custom animations on a humanoid avatar in Unity.
Created a synthetic dataset using Unity which has 50 videos recorded from different angles of camera placed randomly in the Unity Scene using C# script.
Used OpenPose model to detect bone points of humanoid avatar in the video dataset.
Studied Skeleton-Based Action Recognition paper to get insights on how to use it for ASL gesture recognition.

Why :

Between 6 and 8 million people in the United States have some form of language impairment [1]. American Sign Language (ASL) is the leading minority language in the U.S. after the "big four": Spanish, Italian, German, and French.

Proposed Schedule:

Week 1: Jan 27 - Feb 4	Kick-off meeting and review CS298 Proposal.
Week 2,3: Feb 4 - Feb 23	Implement Skeleton-Based Action Recognition for the bone-points dataset to recognize the hand gestures.
Week 4, 5:Feb 23 - Mar 2	Build Animations for other types of ASL words. Generate synthetic dataset for the same.
Week 6,7,8: Mar 2 - Mar 23	Generate the bone points dataset using OpenPose Model. Train the Skeleton-Based Action Recognition Model to recognize these new ASL words.
Week 9, 10, 11: Mar 23 - Apr 13	Work on passing real-time video streams to model. Work on the accuracy of the model.
Week 12: Apr 13- Apr 20	Design User Interface.
Week 13,14, 15:Apr 20 - May 4	Complete the writeup of CS298 final report and prepare project presentation for defense.

Key Deliverables:

Design :
A computer vision model leveraging latest neural network frameworks to detect American Sign Language in the video feed to help deaf and mute communicate better in a shop setting.

Software :

AI model capable of recognizing ASL Alphabet from the video feed in a shop setting based on Le-Net5 architecture.
Program to generate synthetic mp4 video dataset of most popular ASL words designed in Unity Animations in shop setting recorded from various angles.
Program to parse the animated movie generated through Unity Gaming Engine that consists of shopping center scenes where a humanoid avatar is communicating via ASL. These animated mp4 videos would be passed to the OpenPose model as input.
Generation of bone key points dataset using OpenPose model to jointly detect human body, hand, facial, and foot key points from given input animated mp4 video feeds.
Convolution neural network-based framework to recognize the hand gestures and predict the ASL word from the video feed. This model would be trained using the labelled skeleton sequence from bone key points video dataset as an input.
User interface for identifying the intent of the person communicating via ASL based on the implemented CNN framework. These recognized gestures would be matched with the backed dataset to reply to the queries of the deaf and mute in ASL.

Report :

CS 298 Report.
CS 298 Presentation.

Innovations and Challenges:

Understanding peculiarities of the American Sign Language and creating animations for the same in a shop setting using complex software like Unity Game Engine is quite challenging.
The most challenging part in building this AI model is to use a quality dataset that is built with meticulous attention to hand gestures and their specific meanings. This involves a risk of not being very expressive and may leads to miscommunication.
Understanding the intricacies of OpenPose is a complicated task. Using this model to get the labelled skeleton sequence of the bone points of a person in a 3D space from video feeds demands in-depth understanding of its architecture. Any negligence or inaccuracy of the 3-D coordinates might jeopardies the predictions from the action recognition model.
Creating an economically viable and easily integrable end-to-end system for a shopping center to recognize the intent of the deaf and mute that would help them independently shop and have seamless shopping experience is innovative.

References:

[1] Statistics on Voice, Speech, and Language - NIDCD

[2] Hundred basic ASL signs that are frequently used between parents and their young children

[3] Humanoid Avatars