CS298 Proposal

Detecting and Predicting Visual Affordance of objects in a given environment.

Bhumika Kaur Matharu (bhumika.matharu@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun, Sunhera Paul

Abstract:

The rapid growth of progress in artificial intelligence has allowed the development of autonomous robots. Though autonomous robots are transforming the industry in many ways, it still faces many challenges. One of the challenges experienced by autonomous robots is their inability to manipulate an unknown object in a physical environment. If a robot is trained to perform a task in a constraint environment, a small variation in the environment will require the robot to re-train [1]. To overcome the problem, the concept of affordance was introduced. Affordance allows autonomous robots to perform human-like actions with novel objects [1]. It defines the possible set of actions a user (or robot) can perform on an object in a given environment. It helps the robots in recognizing the object, planning and predicting the action without any human supervision [2]. The focus of this research project is to develop a model to predict visual affordance of an object in grocery store setting. It is implemented with the help of neural networks. The model is trained on the synthetic video dataset generated using Unity Gaming Engine.

CS297 Results

Learned to perform classification and detection on a self-generated dataset using Keras and OpenCV
Created a synthetic dataset using Unity Gaming Engine which consists of 50 videos with seven multiple scenes from different angles generated programmatically.
Researched various techniques to perform affordance prediction using videos.
Learned to develop Long Term Short Memory Networks (LSTM) with Keras.
Partially implemented and studied the model Demo2Vec Model.

Proposed Schedule

Week 1: Jan 27 - Feb 2	Kick-off meeting and review CS298 Proposal
Week 2-3: Feb 3 - Feb 16	Continue the implementation of Demo2Vec model [7] on OPRA dataset and show results achieved. Fix the crooked hand animation of Humanoid (Ethan) in synthetic dataset
Week 4-6: Feb 17 - Mar 9	Generate synthetic dataset for other types of affordances. Annotate the synthetic dataset. Create one GameObject with annotation and one without. Use these two GameObjects to create two variations of videos i.e., one annotated and the other not annotated.
Week 7-8: Mar 10 - Mar 24	Prepare the code utilizing ffmpeg to segment the videos into frames which will be sent as input to the model. Apply the implemented Demo2Vec model on synthetic dataset generated and show results.
Week 9-11: Mar 25 - Apr 13	Work on the accuracy of the model by performing necessary transitions
Week 12-16: Apr 14 - May 11	Complete the final report of project and prepare project slides

Key Deliverables:

Design
- Design a neural network architecture to detect and predict the affordance of an input image of an object. The model learns affordances from synthetic videos.
Software
- Prepare a program to parse the movie generated through Unity Gaming Engine which consists of multiple scenes and split them into individual videos using ffmpeg. Then take these segmented videos and extract frames to pass into the model as input.
- Create two versions of synthetic videos, one annotated and the other not annotated. Collect images of objects whose affordances will be predicted by the model.
- Implement an encoder by combining the two ConvLSTM networks with an attention model to create a demonstration embedding vector for the video. Using CNNs develop an affordance predictor which utilizes the demonstration vector of the video to predict the affordance and generate the interaction heatmap on the object.
- Evaluate and enhance the accuracy of the model on the synthetic video dataset.
Report
- CS298 Report
- CS298 Presentation

Innovations and Challenges

To generate synthetic video dataset through Unity Gaming Engine required getting properly familiarized with the platform which is challenging and complex.
The affordance learning through synthetic videos is innovative as in previous researches it has been performed through RGB images and online real-life videos.
The model architecture utilized multiple convLSTM networks which is challenging because it took time to resolve how to combine the results of these two networks to develop the demonstration vector of the video.
Generation of interaction heatmap through affordance predictor is challenging as it requires proper annotation of the dataset to give accurate heatmap around the interaction region.
Need to make the model architecture accurate for the synthetic dataset as it is accurate for real-life videos.

References:

[1]P. Ardon, E. Pairet, K. Lohan, S. Ramamoorthy, R. Petrick, "Affordances in Robotic Tasks -- A Survey" in IEEE Transaction of Robotics, 2019

[2] H. Min, C. Yi, R. Luo, J. Zhu, S. Bi, "Affordance Research in Developmental Robotics: A Survey" in IEEE Transactions on Cognitive and Developmental systems, 2016.

[3] N. Joshi, "Are you aware of these 7 challenges in robotics?", Available online: https://www.allerin.com/blog/are-you-aware-of-these-7-challenges-in-robotics

[4] V. Kaur, V. Khullar and N. Verma, "Review of Artificial Intelligence with retailing sector" in Journal of Computer Science Research, 2020.

[5] S. Harwood, N. Hafezieh, "Affordance - what does this mean?" in 22nd UKAIS Annual Conference, Oxford, UK, 2017.

[6] M. Hassanin, S. Khan and M. Tahtali, "Visual Affordance and Function Understanding: A Survey" in Computer Vision and Pattern Recognition, Utah, USA, 2018.

[7] K. Fang, T. Wu, D. Yang, S. Savarese,J. Lim, "Demo2Vec: Reasoning Object Affordances From Online Videos" in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, 2018, pp. 2139-2147

[8] T. Nagarajan, C. Feichtenhofer, K. Grauman, "Grounded Human-Object Interaction Hotspots from Video" in CoRR, USA, 2019.