CS298 Proposal
Adding a source code searching capability to Yioop!
Snigdha Parvatneni (snigdha.parvatneni@gmail.com)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Sami Khuri and Dr. Teng Moh
Abstract:
The objective of this project is to incorporate two algorithms for source code searching to Yioop, a PHP-based search engine. Source code search enables users to paste code snippets in a search bar and search large collections of source code repositories directly. This project aims to test two methods of searching source code. The first method of code search we will implement is based on building inverted indexes of source codes using logarithmic char-gramming; the second method we will implement is based on building inverted indexes of source code using suffix trees. Both techniques will be compared to find the optimal solution.
CS297 Results:
- Wrote a Naive Bayes classifier program in Java and Python to recognize a given code snippet is from Java or from Python
- Extended Yioop to process Java and Python source codes
- Wrote a PHP program to reproduce the functionality of Git clone using cURL requests.
Proposed Schedule:
Week 1:
Aug 27 - Sept 2 | Discuss in depth various aspects of project with advisor. |
Week 2, 3:
Sept 3 - Sept 16 | Deliverable 1: Integrate feature to reproduce the effect of Git clone in Yioop |
Week 4, 5, 6:
Sept 17 - Oct 7 | Deliverable 2: Build inverted index in Yioop for source code search using logarithmic char-gramming. |
Week 7, 8, 9:
Oct 8 - Oct 28 | Deliverable 3: Build inverted index in Yioop for source code search using suffix trees. |
Week 10, 11, 12:
Oct 29 - Nov 18 | Deliverable 4: Conduct the experiments to compare the results obtained by techniques used in Deliverable 2 and Deliverable 3. |
Week 13:
Nov 19 - Nov 25 | Deliverable 5: First Draft of CS298 Report. |
Week 14:
Nov 26 - Dec 2 | Deliverable 5: Revised Draft of CS298Report. |
Week 15:
Dec 3 - Dec 9 | Deliverable 5: Final Draft of CS298Report. Submitted to committee members. |
Week 16:
Dec 10 - Dec 17 | Defense |
Key Deliverables:
- Software
- New functionality to perform git clone for Java and Python source code files from web server which allows crawling.
- Feature to build inverted index based on logarithmic char-gramming to search source codes.
- Feature to build inverted index based on suffix trees to search source codes.
- New feature of searching source codes.
- Report
- CS298 Report
- Project code and test result documentation
Innovations and Challenges:
- Building inverted index using logarithmic char-gramming and suffix tree methods are the challenges at hand.
- Reproducing Git clone functionality using cURL requests is an accomplishment.
- Efficient way of indexing huge collection of source code files is a challenge.
References:
[1] Patil, M., Thankachan, S. V., Shah, R., Hon, W., Vitter, J. S., Chandrasekaran, S. (2011). Inverted Indexes for Phrases and Strings. In ACM SIGIR , Pages: 555-564
[2] Buttcher, S., Clarke, C., Cormack, G. V. (2010). Information retrieval: Implementing and evaluating search engines. The MIT Press.
[3] Ramisch, C. (2008). N-gram models for language detection. Retrieved August 26, 2013, from http://www.inf.ufrgs.br/~ceramisch/download_files/courses/Master_FRANCE/ENSIMAG_2008_2/Ingenierie_des_Langues_et_de_la_Parole/Rapport.pdf
|