[Bio] [Blog] [Deliverable #1: Compare Basic Summarizer to Centroid-Based Summarizer using ROUGE] [Deliverable #2: Create a Dutch Stemmer for the Yioop Search Engine] [Deliverable #3: Create a New Summarizer for the Yioop Search Engine] [Deliverable #4: Term Frequency Weighting in the Centroid-Based Summarizer] [Deliverable #1: Test Yioop Summarizers Against a Large Data Set] [Deliverable #2: Improve the ROUGE Results for Dr. Pollett's Summarization Algorithm] [CS 299 End of Fall 2015 Semester Summary] [Deliverable #3: A Numerically Stable Lanczos Text Summarization Algorithm] [Deliverable #4: Improving Text Summarization using Automatic Sentence Compression] |
## Create a New Summarizer for the Yioop Search Engine## AimThe goal of this deliverable is to find a paper that covers an algorithm used for summarization. Once I have found the right paper, the algorithm will be implemented in PHP and integrated into the Yioop search engine. ## OverviewTo reiterate from my CS297 proposal, text summarization is the ability to obtain the key ideas from a text passage using as little words as possible. Here I will cover the new algorithm chosen, its results, the work that went into implementing it and the work getting it into the Yioop search engine. ## Work Performed
As stated in the aim, my first objective was to find a good summarizer to implement. I felt it had to be something completely different compared to the existing summarizers in the Yioop search engine. The existing summarizers are covered in Deliverable 1. I came across a worthy algorithm known as Luhn's algorithm. Dr. Pollett did not feel the same way, so I searched further. I eventually stumbled upon a graph based algorithm that looked promising. This time my summarizer suggestion passed Dr. Pollett's evaluation and I proceeded to dig into it.
Those four steps may sound confusing and they are. The last step after computing the adjacency matrix is to extract the summaries. This is done by leveraging the page rank algorithm created by Larry Page and Sergey Brin, the founders of Google. It was intended to be sued for ranking web pages but it appears to have use here too. Basically it computes an initial vector of equal probabilities. For example, if there are 5 sentences, the initial vector would have 5 items all set to 1/5. The adjacency matrix is multiplied by the probability vector to generate a new probability vector. Then the squared difference of the old probability vector and the new probability vector are compared. The process is repeated until the squared difference is below some threshold. This algorithm deviates from the original page rank algorithm. It does not use the squared difference method. They only iterate 10 times. "As few as 10 iterations produced a good approximate ordering, competitive with the exact ordering produced by the traditional convergence measure" (Langville2006). When the 10 iterations are complete, the probability vector is returned and is used to generate the summary. Each sentences is added to the summary in order of highest probability to lowest probability until the allowed summary length has been reached. ## ResultsIn conclusion the Graph Based Summarizer work was some of the most interesting work I have done in a while. The results I got were just below the results stated in the paper. However the results were dramatically lower than the BS and CBS already in the Yioop search engine. It is probably because of the way we chose to weight each sentence. The highest result here is not anywhere near the lowest on either of the other summarizers. For example, in the ROUGE-1 test the CBS 95%-conf int value was 0.76663, the BS value was 0.80587, and the GBS value was 0.32603. It seems that there is more work to do here. - configure.ini
- crawl_component.php
- crawl_constants.php
- fetcher.php
- graph_based_summarizer.php
- html_processor.php
- page_processor.php
- phrase_parser.php
- text_processor.php
## References
[Samei2014] Multi-Document Summarization Using Graph-Based Iterative Ranking Algorithms and Information Theoretical Distortion Measures. Borhan Samei, Marzieh Eshtiagh, Fazel Keshtkar and Sattar Hashemi, The Florida AI Research Society. 2014. |