CS299 Proposal

Experiments with and Implementation of a Context Sensitive Text Summarizer

Charles Bocage (charles.bocage@sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Robert Chun and Dr. Thomas Austin.

Abstract:

This project is focused on text summarization. Text summarization is the ability to obtain the key ideas from a text passage using as little words as possible. Dr. Pollett has a search engine, Yioop, that uses a centroid-based summarizer (CBS) to summarize its crawled documents. A CBS hinges on using a centroid (a set of words that are statistically important to the document) to get the main idea for the document. After that it computes the text frequencies and cosine similarity to build the summary. In CS297 I cloned the GIT Seek Quarry repository, performed a simple crawl and tested its summarizers using ROUGE. I also implemented a Dutch summarizer that is used in the Yioop search engine. Furthermore, I found a paper on a graph-based summarizer, coded it, tested it using ROUGE and added it to the Yioop search engine. Lastly, I implemented Dr. Pollett's summmarizer algorithm and added it to the Yioop search engine.

In this project I will continue to get deeper into text summarization. First, I will compare a few more summarizer methods. One of which is different than any that others that have been looked at so far. For example, a graph summarizer (which I already did), the centroid summarizer with weights and look into the famous summarizer Nick D'Aloisio wrote in the Summly app the he eventually sold to Yahoo for $30 million. In addition, I will do more extensive experiments. For example, using an existing set of summaries and documents like in the Document Understanding Conference (DUC) 2002 data set. Using a larger data set will allow us to really determine how good the summarizers we test perform.

CS297 Results

Cloned the GIT Seek Quarry repository, performed a simple crawl and tested its summarizers using ROUGE
Implemented a Dutch summarizer that is used in the Yioop search engine
Found a paper on a graph-based summarizer, coded it, tested it using ROUGE and added it to the Yioop search engine
Implemented Dr. Pollett's summmarizer algorithm and added it to the Yioop search engine

Proposed Schedule

Week 1: Aug. 20 - Aug 26	I will get the proposal ready.
Week 2: Aug. 27 - Sep. 2	Deliverable #1: I will find a large document set and writing the code to automate the testing the all of the summarizers against the large data set.
Week 3: Sep. 3 - Sep. 9	Continue work from week 2.
Week 4: Sep. 10 - Sep. 16	Continue work from week 3.
Week 5: Sep. 17 - Sep. 23	Deliverable #2: Get Dr. Pollett's algorithm to produce better ROUGE results
Week 6: Sep. 24 - Sep. 30	Continue work from week 5.
Week 7: Oct. 1 - Oct. 7	Continue work from week 6.
Week 8: Oct. 8 - Oct. 14	Continue work from week 7.
Week 9: Oct. 15 - Oct. 21	Deliverable #3: Implement Nick D'Aloisio's summarizer and integrating it into Yioop.
Week 10: Oct. 22 - Oct. 28	Continue work from week 9.
Week 11: Oct. 29 - Nov. 4	Continue work from week 10.
Week 12: Nov. 5 - Nov. 11	Continue work from week 11.
Week 13: Nov. 12 - Nov. 18	Deliverable #4: At this point, I am ready to write the CS 299 report. I will gather all of the information that has taken place over the semester and create the report.
Week 14: Nov. 19 - Nov. 25	Continue work from week 13.
Week 15: Nov. 26 - Dec. 2	Continue work from week 14.
Week 16: Dec. 3 - Dec. 9	Defend my project against my committee.

Key Deliverables:

Software
- Dr. Pollett's algorithm completely implemented
- Implementation of Nick D'Aloisio's summarizer integrated into Yioop
Report
- CS299 Report
- Conduct and document all ROUGE experiment results
- All code written during the semester and its documentation

Innovations and Challenges

Getting Dr. Pollett's algorithm to produce better ROUGE results
Implementing Nick D'Aloisio's summarizer and integrating it into Yioop
Finding a large document set and writing the code to automate the testing the all of the summarizers against it

References:

[LinOch2004] Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. Chin-Yew Lin and Franz Josef Och. Association for Computational Linguistics. 2004.

[Lin2004] Looking for a Few Good Metrics: Automatic Summarization Evaluation How Many Samples Are Enough?. Chin-Yew Lin. NTCIR. 2004.

[Lovins1968] Development of a Stemming Algorithm. Julie Beth Lovins, Mechanical Translation and Computational linguistics. 1968.

[Porter1980] An algorithm for suffix stripping. M.F. Porter. Program: Electronic Library and Information Systems. 1980.

[Willett2006] The Porter stemming algorithm: then and now. Peter Willett. Program: Electronic Library and Information Systems. 2006.

[Porter2001] Snowball: A language for stemming algorithms. M.F. Porter. Snowball. 2001.

[Samei2014] Multi-Document Summarization Using Graph-Based Iterative Ranking Algorithms and Information Theoretical Distortion Measures. Borhan Samei, Marzieh Eshtiagh, Fazel Keshtkar and Sattar Hashemi, The Florida AI Research Society. 2014.

[Agrawal2014] A Graph Based Ranking Strategy for Automated Text Summarization. Nitin Agrawal, Shikhar Sharma, Prashant Sinha, Shobha Bagai, DU Journal of Undergraduate Research and Innovation. 2014.

[Langvielle2006] Google's PageRank and Beyond: The Science of Search Engine Rankings. Amy Langville and Carl Meyer.Princeton University Press.2006.

[Cutler1997] Using the Structure of HTML Documents to Improve Retrieval. Michal Cutler, Yungming Shih and Weiyi Meng. Proceedings of the USENIX Symposium on Internet Technologies and Systems. 1997.

[Bassil 2012] Semantic-Sensitive Web Information Retrieval Model for HTML Documents. Youssef Bassil and Paul Semaan. European Journal of Scientific Research. 2012.

[Bonnington 2011] Teen's IOS App Uses Complex Algorithms to Summarize the Web. Christina Bonnington. Wired.com. 2011.