Test Yioop Summarizers Against a Large Data Set

Aim

The goal of this deliverable is to find a large document set and write code to automate the testing of all summarizers against the large data set.

Overview

Continuing the work I did in my CS 297 deliverables 1, 2, 3 and 4, Dr. Pollett and I felt that we needed to test the set of summarizers in Yioop against a larger set of data. I searched around the Internet for a large document set that consisted of human summaries as well as computer generated summaries. After an exhaustive Internet search most of the resources I read pointed to a document set that is created during a yearly event called the Document Understanding Conference (DUC).

The DUC is an event organized by the National Institute of Standards and Technology (NIST) that consist of a many summarization evaluations. "Its goal is to further progress in automatic text summarizations and enable researchers to participate in large-scale experiments in both the development and evaluations of summarization systems" (NIST2007). The DUC event was held during the years of 2001-2007. Its tasks have been merged into the Text Analytic Conference (TAC) which is also organized by the NIST.

Although the DUC data is a few years old, it works perfectly for our purposes. We will get to talking about the data but we need to cover how the data is constructed first. The DUC data is broken into two tasks. A main task and an update task. The tasks are independent of each other. The participants in the conference can choose to do one of them or both. The main task focuses more on the question answering problem. Contestants are given a topic and a set of 25 relevant documents and need to produce a 250-word summary that answers the questions in the topic. The update task contestants will produce a 100-word summary based on a set of newswire articles from the AQUAINT corpus which are “newswire articles from the Associated Press and New York Times (1998 - 2000) and Xinhua News Agency (1996 - 2000)" (NIST2007). The purpose of the summaries are to update the readers of new information based on the assumption that the readers have already read an earlier set of articles.

My results pertain only to the update task so I will not mention the main task again. Like I said previously, the update tasks summaries are limited to 100 words. Each topic in the update task are a subset of the main DUC documents. The data consists of 10 topics with 25 documents per topic. The documents are broken in to 3 sets: A, B, C. Set A has 10 documents, set B has 8 documents and set C has 7 documents. To get human summaries 4 NIST assessors create a summary for each set of documents. To Dr. Pollett and I, those human summaries are the Holy Grail. The most difficult part of this research is getting people to read documents and summarize them.

Now that we have found our human summaries from the DUC data, we need automated summaries to use in the comparison also. The DUC in 2007 had 22 contestants. Lastly the effectiveness of the summaries are compared using Recall-Oriented Understudy for Gisting Evaluation (ROUGE) just like I did in my previous deliverables. The DUC also evaluated the summaries using BE and a manual pyramid effort. We are only going to be concerned about the results from the ROUGE evaluation.

Work Performed

After figuring out that the DUC data was the large data set I was looking for, I needed access to the data. I contacted the NIST representative through email to confirm what needed to be done for access. I relayed that information to Dr. Pollett and he requested access to the data. After a week or so, the NIST representative responded with what we needed to gain access to the DUC data.

Next, I downloaded the data from 2001-2007 and started to review the data. I quickly found out that the organization of the data was not clear. Dr. Pollett and I spent time in a few of our weekly meetings and we finally figured it out. Each set of documents were broken down into .scu files. The .scu files were basically a custom formatted XML files. The summaries for those .scu files were in a separate Zip file that consisted of the ROUGE configuration, summaries and results. Since I have worked with ROUGE before, this Zip file was easy to interpret. Within the same Zip file was a README.txt that explained what tests were run and how they were run. For example, the ROUGE command line was in there: ROUGE-1.5.5.pl -n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -a -d rougejk.in

Now that I knew where the data was located I needed the topic data in a different format so I could run each summarizer against the data. I wrote a C# utility that converted the .scu files to raw test files with each sentence on its own line. Next I configured Yioop with the BS. After that I wrote some PHP code that would read each converted .scu file into the correct location and kick off the Yioop summarization process. For this automated process to work, each summarizer was configured to save its summarized output to disk. After the summary was created, the same PHP code moved the summarized output to the ROUGE directory and named the file appropriately. I repeated the described process for each of the summarizers (BS, CBS, GBS and CBWS) within the Yioop system.

Once I had all of the summaries generated, it was time to run the ROUGE test. I used the same configuration NIST used and added entries for the Yioop summarized content. On a virtual machine with 4 processors and 4GB of memory, ROUGE took 15 hours to complete. Once ROUGE was complete, I compared the results and provided my findings below.

Results

Remember, Yioop has 4 summarizers: Centroid Based Summarizer (CBS), Basic Summarizer (BS), Graph Based Summarizer (GBS) and a Centroid Based Weighted Summarizer (CBWS). Various ROUGE tests were run using the NIST configuration. Also discuss how they compared to each other, the human and the generated summaries. Out of 38 summaries (4 ours, 12 human and 22 candidates) see which category it did the best: F, P or R.

Although as you can see from below, the summarizer results all average pretty close with the GBS being the best on average. The CBS had the best rank while the BS had the lowest rank. Although our generated summaries were better than some of the other summarizers they still have 13 (17 total minus 4 human) computer generated summarizes ahead of them at best. Doing some analysis, the BS performed pretty similar on each of the F-measure, Precision or Recall tests while the other summarizers did really well with the Recall tests. Maybe the person after me can take our work to the next TAC conference and get the Yioop summarizers some more exposure.

DUC ROUGE Test Results

The results make me wonder about how the sentences are separated within the Yioop system. Each summarizer has its own method to separate the sentences and since this test was pristine, meaning each line had a sentence, we may need to look into this further.

I have abbreviated the ROUGE tests for easy viewing. They all start with an R for ROUGE. The second character is the test type. R1 being the ROUGE 1 test. The last character is the metric. F for F-measure, R for the recall measure and P for the precision measure. In conclusion, R1R stands for the recall metric from the ROUGE 1 test.

Metric	CBS	BS	CBWS	GBS
Lowest Rank	18	21	24	22
Lowest Ranking ROUGE Test	R1R	RLF;RWF	R3F	RLF
Median Rank	37	38	36	34
Average Rank	32.0952381	33.28571429	32.42857143	31.85714286
Highest Rank	37	38	36	35

Summarizer	Rank	ROUGE Test
BS	38	R1F
CBS	37	R1F
CBWS	36	R1F
GBS	35	R1F
BS	38	R1P
CBS	37	R1P
CBWS	36	R1P
GBS	35	R1P
BS	23	R1R
CBS	18	R1R
CBWS	25	R1R
GBS	24	R1R

BS	38	R2F
CBS	37	R2F
CBWS	36	R2F
GBS	33	R2F
BS	38	R2P
CBS	37	R2P
CBWS	36	R2P
GBS	35	R2P
BS	27	R2R
CBS	23	R2R
CBWS	28	R2R
GBS	30	R2R

BS	38	R3F
CBS	37	R3F
CBWS	34	R3F
GBS	33	R3F
BS	38	R3P
CBS	37	R3P
CBWS	36	R3P
GBS	34	R3P
BS	23	R3R
CBS	27	R3R
CBWS	24	R3R
GBS	32	R3R

BS	38	R4F
CBS	36	R4F
CBWS	32	R4F
GBS	29	R4F
BS	38	R4P
CBS	37	R4P
CBWS	35	R4P
GBS	34	R4P
BS	24	R4R
CBS	28	R4R
CBWS	26	R4R
GBS	30	R4R

BS	38	RLF
CBS	37	RLF
CBWS	36	RLF
GBS	35	RLF
BS	38	RLP
CBS	37	RLP
CBWS	36	RLP
GBS	35	RLP
BS	21	RLR
CBS	19	RLR
CBWS	25	RLR
GBS	22	RLR

BS	38	RSF
CBS	37	RSF
CBWS	36	RSF
GBS	35	RSF
BS	38	RSP
CBS	37	RSP
CBWS	36	RSP
GBS	35	RSP
BS	28	RSR
CBS	22	RSR
CBWS	30	RSR
GBS	29	RSR

BS	38	RWF
CBS	37	RWF
CBWS	36	RWF
GBS	35	RWF
BS	38	RWP
CBS	37	RWP
CBWS	36	RWP
GBS	35	RWP
BS	21	RWR
CBS	20	RWR
CBWS	26	RWR
GBS	24	RWR

References

[NIST2007] DUC 2007: Task, Documents, and Measures. NIST. NIST. 2007.