Improve the ROUGE Results for Dr. Pollett's Summarization Algorithm

Aim

The goal of this deliverable is to get Dr. Pollett’s algorithm to produce better results.

Overview

Before I started working with Dr. Pollett on automatic text summarization Yioop had two summarizers; the Basic Summarizer (BS) and the Centroid Bases Summarizer (CBS). The goal here is to improve the results of Dr. Pollett’s algorithm by adding weights to words, AKA terms, within certain tags inside an HTML document. It turns out increasing the frequencies of terms within the HTML document has no impact on improving the summary results. Dr. Pollett and I stumbled onto another idea to experiment with that proved to be a great way to produce better summary results.

Instead of weighting terms within certain tags, we decided to experiment with trimming out the unimportant content by detecting the Content Management System (CMS) that was used to create the web page. Most CMS developed web pages follow a specific pattern when they are rendered by a browser. For example, the page content, navigation content, side-bar content etcetera are located in HTML tags that are decorated with the same names or attributes. Furthermore, in our case we only want the content the web page is to deliver and not the extraneous content like the authors YouTube channel information. By only returning a subset of the web page’s content we thought the summaries would improve and they did.

Work Performed

The first thing to do here is to get Dr. Pollett to explain his algorithm to me. After Dr. Pollett described his brilliant summarization algorithm to me, I looked at the current CBS to see if it was implemented as Dr. Pollett described. What I found was that the CBS is not implemented as Dr. Pollett envisioned. I relayed this information to Dr. Pollett and he confirmed my findings. Instead of modifying the CBS to perform the actions Dr. Pollett intended, a new summarizer was born. The new summarizer is called the Centroid Based Weighted Summarizer (CBWS). The only item that is similar to the CBS is that they both output a word cloud.

The main idea of the CBWS is to find the sentences that are closest to the average sentence. This is done by first calculating the term frequencies for all sentences. Next the term frequencies for all of the sentences are normalized. Then the average sentence is calculated. The average sentence is calculated by treating the frequencies like a matrix, adding each term frequency column value and dividing each value by the total number of rows. Now that we have the average sentence we compare all of the other sentences to it by calculating the dot product of the sentences and the average sentence. Lastly, the dot product values are sorted with the largest at the top. At this point we can generate the summary by pulling the sentences from the top to the bottom until the max characters have been reached.

After I understood the algorithm and wrote the code for it, I needed to integrate in into the Yioop search engine. I generated its ROUGE result so that I could have a baseline to improve upon. The initial ROUGE results were not that bad. The results were not as good as the BS and CBS but were better than the Graph Based Summarizer (GBS). Here I am only mentioning the results as the results will be discussed later. At this point we are ready to discuss the work that went into weighting the terms within the HTML tags.

I started work on weighting during Deliverable #4 of my CS 297 research. Dr. Pollett and I had some more ideas to try to see if we could produce better results. We tried varying the order of the tag importance, increasing the weights to ridiculous values and adding weights based on the CMS it detected. None of our ides bore any fruit and we decided we would need to come up with another idea. Dr. Pollett and I discussed alternatives and we came up with an idea that did not involve weighting the terms at all. It was to see if we can use CMS detection to focus the summarizer’s attention on only the important content.

Our reasoning behind why we wanted to focus on important content only is the fact that most CMS generated web pages are full of unnecessary content that most of us humans ignore. The summarizers are not as smart as we are so they need us to help them. As stated in my overview, CMSs like WordPress generate pages using the same structure. For example all of the WordPress pages used for my blog have the important content in a div tag with the id of content. All of the content above it and below it is irrelevant for summarization. We also found that there are other common tags within the important content container that can be easily removed too. For example, within the div content tag is div tag that contains information about the author of the page. While that may be important to the reader is not important to the summarizer. Also the author information on all of my WordPress pages is located within a div tag called post-by.

How does Yioop detect CMSs you ask? I did some research to see what other CMS detectors were doing. It turns out that most, if not all, CMS generated web pages leave a fingerprint behind in the head tag. For example, a head tag from my blog’s WordPress pages have paths to stylesheets that almost always contain the words wp-content or wp-include. Although any web page could have a similar entry in its head tag and not be generated by WordPress, we are assuming they do for our experiment.

After I created two CMS detectors, one for WordPress and one for Drupal, I ran ROUGE again to see if the results had improved. Low and behold the results did improve. The results improved by ten percent to twenty percent. For example the ROUGE-1 Average_R score went up from 0.66665 to 0.80863 for the CBWS summarizer. Naturally the next thing to do was to see if I could get the same results from the other summarizers. The only one that did not increase was the CBS. I am not sure why but I considered these results to be a success.

Results

In conclusion, we were able to improve the results of the summarizers. Although weighting the terms did not turn out to be fruitful, zoning in on the important content did. While I have only scratched the surface CMS detection, I feel we have laid a good foundation for more work to come. Maybe we could come up with a way to automate generating the CMS detectors or at least teach the system how to detect a CMS on its own.

Yioop Source Files Added
Summarizer ROUGE Results

CBWS ROUGE Result Before CBWS ROUGE Result After
---------------------------------------------
11 ROUGE-1 Average_R: 0.66665 (95%-conf.int. 0.56143 - 0.78024)
11 ROUGE-1 Average_P: 0.59174 (95%-conf.int. 0.44167 - 0.75417)
11 ROUGE-1 Average_F: 0.61670 (95%-conf.int. 0.49659 - 0.74103)
---------------------------------------------
11 ROUGE-2 Average_R: 0.59220 (95%-conf.int. 0.47333 - 0.72667)
11 ROUGE-2 Average_P: 0.52933 (95%-conf.int. 0.35952 - 0.71357)
11 ROUGE-2 Average_F: 0.54518 (95%-conf.int. 0.41064 - 0.68843)
---------------------------------------------
11 ROUGE-3 Average_R: 0.47023 (95%-conf.int. 0.32833 - 0.63333)
11 ROUGE-3 Average_P: 0.44098 (95%-conf.int. 0.24762 - 0.65000)
11 ROUGE-3 Average_F: 0.43500 (95%-conf.int. 0.27667 - 0.60444)
---------------------------------------------
11 ROUGE-4 Average_R: 0.21573 (95%-conf.int. 0.00000 - 0.45000)
11 ROUGE-4 Average_P: 0.20247 (95%-conf.int. 0.00000 - 0.42667)
11 ROUGE-4 Average_F: 0.20307 (95%-conf.int. 0.00000 - 0.40952)
---------------------------------------------
11 ROUGE-L Average_R: 0.46695 (95%-conf.int. 0.35840 - 0.59688)
11 ROUGE-L Average_P: 0.59174 (95%-conf.int. 0.44167 - 0.75417)
11 ROUGE-L Average_F: 0.50737 (95%-conf.int. 0.38835 - 0.65018)
---------------------------------------------
11 ROUGE-W-1.2 Average_R: 0.27444 (95%-conf.int. 0.16971 - 0.37699)
11 ROUGE-W-1.2 Average_P: 0.43489 (95%-conf.int. 0.25417 - 0.63333)
11 ROUGE-W-1.2 Average_F: 0.32801 (95%-conf.int. 0.19819 - 0.46248)
---------------------------------------------
11 ROUGE-S* Average_R: 0.43092 (95%-conf.int. 0.28524 - 0.60095)
11 ROUGE-S* Average_P: 0.38901 (95%-conf.int. 0.19166 - 0.60952)
11 ROUGE-S* Average_F: 0.37903 (95%-conf.int. 0.22536 - 0.54553)
---------------------------------------------
11 ROUGE-SU* Average_R: 0.52030 (95%-conf.int. 0.38368 - 0.67646)
11 ROUGE-SU* Average_P: 0.44916 (95%-conf.int. 0.25394 - 0.65976)
11 ROUGE-SU* Average_F: 0.45378 (95%-conf.int. 0.30683 - 0.60927)
---------------------------------------------
11 ROUGE-1 Average_R: 0.80863 (95%-conf.int. 0.70071 - 0.89833)
11 ROUGE-1 Average_P: 0.79195 (95%-conf.int. 0.69143 - 0.88214)
11 ROUGE-1 Average_F: 0.79771 (95%-conf.int. 0.69705 - 0.88261)
---------------------------------------------
11 ROUGE-2 Average_R: 0.71814 (95%-conf.int. 0.58000 - 0.83667)
11 ROUGE-2 Average_P: 0.70826 (95%-conf.int. 0.57000 - 0.83000)
11 ROUGE-2 Average_F: 0.71019 (95%-conf.int. 0.57272 - 0.82778)
---------------------------------------------
11 ROUGE-3 Average_R: 0.63218 (95%-conf.int. 0.45833 - 0.78833)
11 ROUGE-3 Average_P: 0.62551 (95%-conf.int. 0.45167 - 0.79000)
11 ROUGE-3 Average_F: 0.62487 (95%-conf.int. 0.45143 - 0.78135)
---------------------------------------------
11 ROUGE-4 Average_R: 0.45574 (95%-conf.int. 0.20000 - 0.68333)
11 ROUGE-4 Average_P: 0.46348 (95%-conf.int. 0.21667 - 0.71667)
11 ROUGE-4 Average_F: 0.45503 (95%-conf.int. 0.21333 - 0.68952)
---------------------------------------------
11 ROUGE-L Average_R: 0.55198 (95%-conf.int. 0.44967 - 0.67684)
11 ROUGE-L Average_P: 0.79195 (95%-conf.int. 0.69143 - 0.88214)
11 ROUGE-L Average_F: 0.63472 (95%-conf.int. 0.55068 - 0.73380)
---------------------------------------------
11 ROUGE-W-1.2 Average_R: 0.35800 (95%-conf.int. 0.27424 - 0.46042)
11 ROUGE-W-1.2 Average_P: 0.68562 (95%-conf.int. 0.57119 - 0.79762)
11 ROUGE-W-1.2 Average_F: 0.45893 (95%-conf.int. 0.36787 - 0.56848)
---------------------------------------------
11 ROUGE-S* Average_R: 0.65210 (95%-conf.int. 0.48857 - 0.79953)
11 ROUGE-S* Average_P: 0.62371 (95%-conf.int. 0.46000 - 0.78143)
11 ROUGE-S* Average_F: 0.62855 (95%-conf.int. 0.47200 - 0.77133)
---------------------------------------------
11 ROUGE-SU* Average_R: 0.72396 (95%-conf.int. 0.57849 - 0.84921)
11 ROUGE-SU* Average_P: 0.69407 (95%-conf.int. 0.55905 - 0.82167)
11 ROUGE-SU* Average_F: 0.70080 (95%-conf.int. 0.56488 - 0.81945)