Chris Pollett > Students > Snigdha

    Print View

    [Bio]

    [CS297 Proposal]

    [Project Blog]

    [Del 1: Reading Review - PDF]

    [Del 2: Naive Bayes Classifier]

    [Del 3: Language Setting]

    [Del 4: Git Clone using cURL]

    [Del 5: CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Presentation - PDF]

    [CS298 Report - PDF]

                          

























CS298 Proposal

Adding a source code searching capability to Yioop!

Snigdha Parvatneni (snigdha.parvatneni@gmail.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Sami Khuri and Dr. Teng Moh

Abstract:

The objective of this project is to incorporate two algorithms for source code searching to Yioop, a PHP-based search engine. Source code search enables users to paste code snippets in a search bar and search large collections of source code repositories directly. This project aims to test two methods of searching source code. The first method of code search we will implement is based on building inverted indexes of source codes using logarithmic char-gramming; the second method we will implement is based on building inverted indexes of source code using suffix trees. Both techniques will be compared to find the optimal solution.

CS297 Results:

  • Wrote a Naive Bayes classifier program in Java and Python to recognize a given code snippet is from Java or from Python
  • Extended Yioop to process Java and Python source codes
  • Wrote a PHP program to reproduce the functionality of Git clone using cURL requests.

Proposed Schedule:

Week 1: Aug 27 - Sept 2Discuss in depth various aspects of project with advisor.
Week 2, 3: Sept 3 - Sept 16Deliverable 1: Integrate feature to reproduce the effect of Git clone in Yioop
Week 4, 5, 6: Sept 17 - Oct 7Deliverable 2: Build inverted index in Yioop for source code search using logarithmic char-gramming.
Week 7, 8, 9: Oct 8 - Oct 28Deliverable 3: Build inverted index in Yioop for source code search using suffix trees.
Week 10, 11, 12: Oct 29 - Nov 18Deliverable 4: Conduct the experiments to compare the results obtained by techniques used in Deliverable 2 and Deliverable 3.
Week 13: Nov 19 - Nov 25Deliverable 5: First Draft of CS298 Report.
Week 14: Nov 26 - Dec 2Deliverable 5: Revised Draft of CS298Report.
Week 15: Dec 3 - Dec 9Deliverable 5: Final Draft of CS298Report. Submitted to committee members.
Week 16: Dec 10 - Dec 17Defense

Key Deliverables:

  • Software
    • New functionality to perform git clone for Java and Python source code files from web server which allows crawling.
    • Feature to build inverted index based on logarithmic char-gramming to search source codes.
    • Feature to build inverted index based on suffix trees to search source codes.
    • New feature of searching source codes.
  • Report
    • CS298 Report
    • Project code and test result documentation

Innovations and Challenges:

  • Building inverted index using logarithmic char-gramming and suffix tree methods are the challenges at hand.
  • Reproducing Git clone functionality using cURL requests is an accomplishment.
  • Efficient way of indexing huge collection of source code files is a challenge.

References:

[1] Patil, M., Thankachan, S. V., Shah, R., Hon, W., Vitter, J. S., Chandrasekaran, S. (2011). Inverted Indexes for Phrases and Strings. In ACM SIGIR , Pages: 555-564

[2] Buttcher, S., Clarke, C., Cormack, G. V. (2010). Information retrieval: Implementing and evaluating search engines. The MIT Press.

[3] Ramisch, C. (2008). N-gram models for language detection. Retrieved August 26, 2013, from http://www.inf.ufrgs.br/~ceramisch/download_files/courses/Master_FRANCE/ENSIMAG_2008_2/Ingenierie_des_Langues_et_de_la_Parole/Rapport.pdf