Chris Pollett > Students > Charles

    Print View

    [Bio]

    [Blog]

    [CS297 Proposal]

    [Deliverable #1: Compare Basic Summarizer to Centroid-Based Summarizer using ROUGE]

    [Deliverable #2: Create a Dutch Stemmer for the Yioop Search Engine]

    [Deliverable #3: Create a New Summarizer for the Yioop Search Engine]

    [Deliverable #4: Term Frequency Weighting in the Centroid-Based Summarizer]

    [CS297 Report]

    [CS299 Proposal]

    [Deliverable #1: Test Yioop Summarizers Against a Large Data Set]

    [Deliverable #2: Improve the ROUGE Results for Dr. Pollett's Summarization Algorithm]

    [CS 299 End of Fall 2015 Semester Summary]

    [Deliverable #3: A Numerically Stable Lanczos Text Summarization Algorithm]

    [Deliverable #4: Improving Text Summarization using Automatic Sentence Compression]

    [CS299 Presentation]

    [CS299 Report]

                          



























CS299 Proposal

Experiments with and Implementation of a Context Sensitive Text Summarizer

Charles Bocage (charles.bocage@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun and Dr. Thomas Austin.

Abstract:

This project is focused on text summarization. Text summarization is the ability to obtain the key ideas from a text passage using as little words as possible. Dr. Pollett has a search engine, Yioop, that uses a centroid-based summarizer (CBS) to summarize its crawled documents. A CBS hinges on using a centroid (a set of words that are statistically important to the document) to get the main idea for the document. After that it computes the text frequencies and cosine similarity to build the summary. In CS297 I cloned the GIT Seek Quarry repository, performed a simple crawl and tested its summarizers using ROUGE. I also implemented a Dutch summarizer that is used in the Yioop search engine. Furthermore, I found a paper on a graph-based summarizer, coded it, tested it using ROUGE and added it to the Yioop search engine. Lastly, I implemented Dr. Pollett's summmarizer algorithm and added it to the Yioop search engine.

In this project I will continue to get deeper into text summarization. First, I will compare a few more summarizer methods. One of which is different than any that others that have been looked at so far. For example, a graph summarizer (which I already did), the centroid summarizer with weights and look into the famous summarizer Nick D'Aloisio wrote in the Summly app the he eventually sold to Yahoo for $30 million. In addition, I will do more extensive experiments. For example, using an existing set of summaries and documents like in the Document Understanding Conference (DUC) 2002 data set. Using a larger data set will allow us to really determine how good the summarizers we test perform.

CS297 Results

  • Cloned the GIT Seek Quarry repository, performed a simple crawl and tested its summarizers using ROUGE
  • Implemented a Dutch summarizer that is used in the Yioop search engine
  • Found a paper on a graph-based summarizer, coded it, tested it using ROUGE and added it to the Yioop search engine
  • Implemented Dr. Pollett's summmarizer algorithm and added it to the Yioop search engine

Proposed Schedule

Week 1: Aug. 20 - Aug 26I will get the proposal ready.
Week 2: Aug. 27 - Sep. 2Deliverable #1: I will find a large document set and writing the code to automate the testing the all of the summarizers against the large data set.
Week 3: Sep. 3 - Sep. 9Continue work from week 2.
Week 4: Sep. 10 - Sep. 16Continue work from week 3.
Week 5: Sep. 17 - Sep. 23Deliverable #2: Get Dr. Pollett's algorithm to produce better ROUGE results
Week 6: Sep. 24 - Sep. 30Continue work from week 5.
Week 7: Oct. 1 - Oct. 7Continue work from week 6.
Week 8: Oct. 8 - Oct. 14Continue work from week 7.
Week 9: Oct. 15 - Oct. 21Deliverable #3: Implement Nick D'Aloisio's summarizer and integrating it into Yioop.
Week 10: Oct. 22 - Oct. 28Continue work from week 9.
Week 11: Oct. 29 - Nov. 4Continue work from week 10.
Week 12: Nov. 5 - Nov. 11Continue work from week 11.
Week 13: Nov. 12 - Nov. 18Deliverable #4: At this point, I am ready to write the CS 299 report. I will gather all of the information that has taken place over the semester and create the report.
Week 14: Nov. 19 - Nov. 25Continue work from week 13.
Week 15: Nov. 26 - Dec. 2Continue work from week 14.
Week 16: Dec. 3 - Dec. 9Defend my project against my committee.

Key Deliverables:

  • Software
    • Dr. Pollett's algorithm completely implemented
    • Implementation of Nick D'Aloisio's summarizer integrated into Yioop
  • Report
    • CS299 Report
    • Conduct and document all ROUGE experiment results
    • All code written during the semester and its documentation

Innovations and Challenges

  • Getting Dr. Pollett's algorithm to produce better ROUGE results
  • Implementing Nick D'Aloisio's summarizer and integrating it into Yioop
  • Finding a large document set and writing the code to automate the testing the all of the summarizers against it

References:

[LinOch2004] Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. Chin-Yew Lin and Franz Josef Och. Association for Computational Linguistics. 2004.

[Lin2004] Looking for a Few Good Metrics: Automatic Summarization Evaluation How Many Samples Are Enough?. Chin-Yew Lin. NTCIR. 2004.

[Lovins1968] Development of a Stemming Algorithm. Julie Beth Lovins, Mechanical Translation and Computational linguistics. 1968.

[Porter1980] An algorithm for suffix stripping. M.F. Porter. Program: Electronic Library and Information Systems. 1980.

[Willett2006] The Porter stemming algorithm: then and now. Peter Willett. Program: Electronic Library and Information Systems. 2006.

[Porter2001] Snowball: A language for stemming algorithms. M.F. Porter. Snowball. 2001.

[Samei2014] Multi-Document Summarization Using Graph-Based Iterative Ranking Algorithms and Information Theoretical Distortion Measures. Borhan Samei, Marzieh Eshtiagh, Fazel Keshtkar and Sattar Hashemi, The Florida AI Research Society. 2014.

[Agrawal2014] A Graph Based Ranking Strategy for Automated Text Summarization. Nitin Agrawal, Shikhar Sharma, Prashant Sinha, Shobha Bagai, DU Journal of Undergraduate Research and Innovation. 2014.

[Langvielle2006] Google's PageRank and Beyond: The Science of Search Engine Rankings. Amy Langville and Carl Meyer.Princeton University Press.2006.

[Cutler1997] Using the Structure of HTML Documents to Improve Retrieval. Michal Cutler, Yungming Shih and Weiyi Meng. Proceedings of the USENIX Symposium on Internet Technologies and Systems. 1997.

[Bassil 2012] Semantic-Sensitive Web Information Retrieval Model for HTML Documents. Youssef Bassil and Paul Semaan. European Journal of Scientific Research. 2012.

[Bonnington 2011] Teen's IOS App Uses Complex Algorithms to Summarize the Web. Christina Bonnington. Wired.com. 2011.