Test Yioop Summarizers Against a Large Data Set

Aim

The goal of this deliverable is to find a large document set and write code to automate the testing of all summarizers against the large data set.

Overview

Continuing the work I did in my CS 297 deliverables 1, 2, 3 and 4, Dr. Pollett and I felt that we needed to test the set of summarizers in Yioop against a larger set of data. I searched around the Internet for a large document set that consisted of human summaries as well as computer generated summaries. After an exhaustive Internet search most of the resources I read pointed to a document set that is created during a yearly event called the Document Understanding Conference (DUC).

The DUC is an event organized by the National Institute of Standards and Technology (NIST) that consist of a many summarization evaluations. "Its goal is to further progress in automatic text summarizations and enable researchers to participate in large-scale experiments in both the development and evaluations of summarization systems" (NIST2007). The DUC event was held during the years of 2001-2007. Its tasks have been merged into the Text Analytic Conference (TAC) which is also organized by the NIST.

Although the DUC data is a few years old, it works perfectly for our purposes. We will get to talking about the data but we need to cover how the data is constructed first. The DUC data is broken into two tasks. A main task and an update task. The tasks are independent of each other. The participants in the conference can choose to do one of them or both. The main task focuses more on the question answering problem. Contestants are given a topic and a set of 25 relevant documents and need to produce a 250-word summary that answers the questions in the topic. The update task contestants will produce a 100-word summary based on a set of newswire articles from the AQUAINT corpus which are “newswire articles from the Associated Press and New York Times (1998 - 2000) and Xinhua News Agency (1996 - 2000)" (NIST2007). The purpose of the summaries are to update the readers of new information based on the assumption that the readers have already read an earlier set of articles.

My results pertain only to the update task so I will not mention the main task again. Like I said previously, the update tasks summaries are limited to 100 words. Each topic in the update task are a subset of the main DUC documents. The data consists of 10 topics with 25 documents per topic. The documents are broken in to 3 sets: A, B, C. Set A has 10 documents, set B has 8 documents and set C has 7 documents. To get human summaries 4 NIST assessors create a summary for each set of documents. To Dr. Pollett and I, those human summaries are the Holy Grail. The most difficult part of this research is getting people to read documents and summarize them.

Now that we have found our human summaries from the DUC data, we need automated summaries to use in the comparison also. The DUC in 2007 had 22 contestants. Lastly the effectiveness of the summaries are compared using Recall-Oriented Understudy for Gisting Evaluation (ROUGE) just like I did in my previous deliverables. The DUC also evaluated the summaries using BE and a manual pyramid effort. We are only going to be concerned about the results from the ROUGE evaluation.

Work Performed

After figuring out that the DUC data was the large data set I was looking for, I needed access to the data. I contacted the NIST representative through email to confirm what needed to be done for access. I relayed that information to Dr. Pollett and he requested access to the data. After a week or so, the NIST representative responded with what we needed to gain access to the DUC data.

Next, I downloaded the data from 2001-2007 and started to review the data. I quickly found out that the organization of the data was not clear. Dr. Pollett and I spent time in a few of our weekly meetings and we finally figured it out. Each set of documents were broken down into .scu files. The .scu files were basically a custom formatted XML files. The summaries for those .scu files were in a separate Zip file that consisted of the ROUGE configuration, summaries and results. Since I have worked with ROUGE before, this Zip file was easy to interpret. Within the same Zip file was a README.txt that explained what tests were run and how they were run. For example, the ROUGE command line was in there: ROUGE-1.5.5.pl -n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -a -d rougejk.in

Now that I knew where the data was located I needed the topic data in a different format so I could run each summarizer against the data. I wrote a C# utility that converted the .scu files to raw test files with each sentence on its own line. Next I configured Yioop with the BS. After that I wrote some PHP code that would read each converted .scu file into the correct location and kick off the Yioop summarization process. For this automated process to work, each summarizer was configured to save its summarized output to disk. After the summary was created, the same PHP code moved the summarized output to the ROUGE directory and named the file appropriately. I repeated the described process for each of the summarizers (BS, CBS, GBS and CBWS) within the Yioop system.

Once I had all of the summaries generated, it was time to run the ROUGE test. I used the same configuration NIST used and added entries for the Yioop summarized content. On a virtual machine with 4 processors and 4GB of memory, ROUGE took 15 hours to complete. Once ROUGE was complete, I compared the results and provided my findings below.

Results

Remember, Yioop has 4 summarizers: Centroid Based Summarizer (CBS), Basic Summarizer (BS), Graph Based Summarizer (GBS) and a Centroid Based Weighted Summarizer (CBWS). Various ROUGE tests were run using the NIST configuration. Also discuss how they compared to each other, the human and the generated summaries. Out of 38 summaries (4 ours, 12 human and 22 candidates) see which category it did the best: F, P or R.

Although as you can see from below, the summarizer results all average pretty close with the GBS being the best on average. The CBS had the best rank while the BS had the lowest rank. Although our generated summaries were better than some of the other summarizers they still have 13 (17 total minus 4 human) computer generated summarizes ahead of them at best. Doing some analysis, the BS performed pretty similar on each of the F-measure, Precision or Recall tests while the other summarizers did really well with the Recall tests. Maybe the person after me can take our work to the next TAC conference and get the Yioop summarizers some more exposure.

DUC ROUGE Test Results

The results make me wonder about how the sentences are separated within the Yioop system. Each summarizer has its own method to separate the sentences and since this test was pristine, meaning each line had a sentence, we may need to look into this further.

I have abbreviated the ROUGE tests for easy viewing. They all start with an R for ROUGE. The second character is the test type. R1 being the ROUGE 1 test. The last character is the metric. F for F-measure, R for the recall measure and P for the precision measure. In conclusion, R1R stands for the recall metric from the ROUGE 1 test.

Metric CBS BS CBWS GBS
Lowest Rank 18 21 24 22
Lowest Ranking ROUGE Test   R1R RLF;RWF R3F RLF
Median Rank 37 38 36 34
Average Rank 32.0952381 33.28571429 32.42857143 31.85714286
Highest Rank 37 38 36 35


Summarizer Rank ROUGE Test
BS38R1F
CBS37R1F
CBWS36R1F
GBS35R1F
BS38R1P
CBS37R1P
CBWS36R1P
GBS35R1P
BS23R1R
CBS18R1R
CBWS25R1R
GBS24R1R
BS38R2F
CBS37R2F
CBWS36R2F
GBS33R2F
BS38R2P
CBS37R2P
CBWS36R2P
GBS35R2P
BS27R2R
CBS23R2R
CBWS28R2R
GBS30R2R
BS38R3F
CBS37R3F
CBWS34R3F
GBS33R3F
BS38R3P
CBS37R3P
CBWS36R3P
GBS34R3P
BS23R3R
CBS27R3R
CBWS24R3R
GBS32R3R
BS38R4F
CBS36R4F
CBWS32R4F
GBS29R4F
BS38R4P
CBS37R4P
CBWS35R4P
GBS34R4P
BS24R4R
CBS28R4R
CBWS26R4R
GBS30R4R
BS38RLF
CBS37RLF
CBWS36RLF
GBS35RLF
BS38RLP
CBS37RLP
CBWS36RLP
GBS35RLP
BS21RLR
CBS19RLR
CBWS25RLR
GBS22RLR
BS38RSF
CBS37RSF
CBWS36RSF
GBS35RSF
BS38RSP
CBS37RSP
CBWS36RSP
GBS35RSP
BS28RSR
CBS22RSR
CBWS30RSR
GBS29RSR
BS38RWF
CBS37RWF
CBWS36RWF
GBS35RWF
BS38RWP
CBS37RWP
CBWS36RWP
GBS35RWP
BS21RWR
CBS20RWR
CBWS26RWR
GBS24RWR

References

[NIST2007] DUC 2007: Task, Documents, and Measures. NIST. NIST. 2007.