Chris Pollett > Students > Charles

    Print View



    [CS297 Proposal]

    [Deliverable #1: Compare Basic Summarizer to Centroid-Based Summarizer using ROUGE]

    [Deliverable #2: Create a Dutch Stemmer for the Yioop Search Engine]

    [Deliverable #3: Create a New Summarizer for the Yioop Search Engine]

    [Deliverable #4: Term Frequency Weighting in the Centroid-Based Summarizer]

    [CS297 Report]

    [CS299 Proposal]

    [Deliverable #1: Test Yioop Summarizers Against a Large Data Set]

    [Deliverable #2: Improve the ROUGE Results for Dr. Pollett's Summarization Algorithm]

    [CS 299 End of Fall 2015 Semester Summary]

    [Deliverable #3: A Numerically Stable Lanczos Text Summarization Algorithm]

    [Deliverable #4: Improving Text Summarization using Automatic Sentence Compression]

    [CS299 Presentation]

    [CS299 Report]


Advisor: Dr. Chris Pollett


This project is focused on the text summarization. Text summarization is the ability to obtain the key ideas from a text passage using as little words as possible. Dr. Pollett has a search engine, Yioop, that uses a centroid-based summarizer (CBS) to summarize its crawled documents. A CBS hinges on using a centroid (a set of words that are statistically important to the document) to get the main idea for the document. After that it computes the text frequencies and cosine similarity to build the summary. This project will attempt to improve the existing CBS by weighting the sentences based on their location in the content. For example, if the sentence is within a H1 tag, it will have a more signifcant weight versus a sentence in a H2 tag. The results from the current CBS and the improved version will be compared using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) software package, which as of now is the gold standard for calculating summarization metrics.


Week 1: Jan. 22 - 28I will get the propasal ready.
Week 2: Jan. 29 Feb. 4Deliverable #1: I will clone the Seek Quarry GIT repository and get it running. After that I will get a simple crawl going. Once I have crawled some content I will test each of the summarizers on a sample of web pages of my choice.
Week 3: Feb. 5 Feb 11Continue work from week 2.
Week 4: Feb. 12 Feb. 18Deliverable #2: I will create a stemmer in the Dutch language that can be used in the Yioop search engine.
Week 5: Feb. 19 Feb. 25Continue work from week 4.
Week 6: Feb. 26 Mar. 4Continue work from week 5.
Week 7: Mar. 5 Mar. 11Deliverable #3: I will find a paper that covers an algorithm used for summarization. Once I have found the right paper, I will implement the code. I will choose the best language for the job but will most likely be PHP.
Week 8: Mar. 12 Mar. 18Continue work from week 7.
Week 9: Mar. 19 Mar. 25Continue work from week 8.
Week 10: Mar. 26 Apr. 1Deliverable #4: I will implement Dr. Pollett's summarizer algorithm in PHP. The goal would be to see how good it is and add it as one of the choices in the Yioop serarch engine.
Week 11: Apr. 2 Apr. 8Continue work from week 10.
Week 12: Apr. 9 Apr. 15Continue work from week 11.
Week 13: Apr. 16 Apr. 22Continue work from week 12.
Week 14: Apr. 23 Apr. 29Deliverable #5: At this point, I am ready to write the CS 297 report. I will gather all of the information that has taken place over the semester and create a close to 10 page report.
Week 15: Apr. 30 May. 6Continue work from week 14.
Week 16: May. 7 May. 13Continue work from week 15.


The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Clone the GIT Seek Quarry repository, perform a simple crawl and test its summarizers.

2. Implement a Dutch summarizer that can be used in the Yioop search engine.

3. Find a paper on the summarization topic that contains an algorithm and code it.

4. Implement Dr. Pollett's summmarizer algorithm.

5. Complete the CS 297 report.


[Shen2004] Web-page classification through summarization. Dou Shen, Zheng Chen, Qiang Yang, Hua-Jun Zeng, Benyu Zhang, Yuchang Lu, Wei-Ying Ma. ACM New York, NY, USA. 2004