CS297 ProposalExperiments with and Implementation of a Context Sensitive Text SummarizerCharles Bocage (email) Advisor: Dr. Chris Pollett Description: This project is focused on the text summarization. Text summarization is the ability to obtain the key ideas from a text passage using as little words as possible. Dr. Pollett has a search engine, Yioop, that uses a centroid-based summarizer (CBS) to summarize its crawled documents. A CBS hinges on using a centroid (a set of words that are statistically important to the document) to get the main idea for the document. After that it computes the text frequencies and cosine similarity to build the summary. This project will attempt to improve the existing CBS by weighting the sentences based on their location in the content. For example, if the sentence is within a H1 tag, it will have a more signifcant weight versus a sentence in a H2 tag. The results from the current CBS and the improved version will be compared using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) software package, which as of now is the gold standard for calculating summarization metrics. Schedule:
Deliverables: The full project will be done when CS298 is completed. The following will be done by the end of CS297: 1. Clone the GIT Seek Quarry repository, perform a simple crawl and test its summarizers. 2. Implement a Dutch summarizer that can be used in the Yioop search engine. 3. Find a paper on the summarization topic that contains an algorithm and code it. 4. Implement Dr. Pollett's summmarizer algorithm. 5. Complete the CS 297 report. References: [Shen2004] Web-page classification through summarization. Dou Shen, Zheng Chen, Qiang Yang, Hua-Jun Zeng, Benyu Zhang, Yuchang Lu, Wei-Ying Ma. ACM New York, NY, USA. 2004 |