Chris Pollett > Students > Charles

Print View

[Bio]

[Blog]

﻿

# Term Frequency Weighting in the Centroid-Based Summarizer

## Aim

The goal of this deliverable was to implement Dr. Pollett’s summarizer algorithm. Once the algorithm is implemented, I will test the algorithm using ROUGE and compare its results to the other summarizers.

## Overview

This deliverable is an extension of Deliverable 1 and Deliverable 3. Here I will cover Dr. Pollett’s algorithm and its results.

## Work Performed

Before any code could be written, Dr. Pollett and I needed to discuss his algorithm. His idea is to add a weighting algorithm to the centroid-based summarizer in order to improve its results. In short if you increase the frequency you increase the weight. While that sounds great, more than just the idea needed to be defined. We need to define how the terms will be weighted and what we will do with the additional weight values. Dr. Pollett was firm on the idea that we could use the location of the terms in the html document. For example, if the term is in an h1 (heading) tag we would add some numerical weight to that terms frequency. We also discussed weighting terms based on if they appeared in the beginning, middle or at the end of the document. Furthermore, we discussed looking at the sentence length of the sentence the term was found in and increase its weight that way.

After we had all of our ideas on the table, we decided to go with weighting by tag idea. Then I started doing some research on the Internet and found two documents;
Using the Structure of HTML Documents to Improve Retrieval and Semantic-Sensitive Web Information Retrieval Model for HTML Documents that looked promising. Their idea is to organize certain html tags into classes and weight them according to a hierarchy. In other words, if a term was found in an A (anchor) tag, then it does not matter if the term is found in any other tag. Using the Structure of HTML Documents to Improve Retrieval mentioned that it used 6 classes of html tags. The first class contains only the A (anchor) tag. The second class contains the H1 and H2 heading tags. The third class contains the H3, H4, H5 and H6 heading tags. The fourth class contains the text emphasizing and list tags. They are STRONG, B, EM, I, U, DL, OL and UL tags. The fifth class contains the TITLE tag and the sixth class is everything else. Terms that fall through to the sixth class gets no additional weight.

Now that we know how we planned on weighting the terms, we will get into what the weighting algorithm will do to get the new weight. Instead of having a hierarchy like in the paper Using the Structure of HTML Documents to Improve Retrieval, all class occurrences will be counted. That is if the term is in an anchor tag and a heading tag, they will both be counted. Furthermore, the weight of each tag will be multiplied by its term frequencies in that tag and added up. In other words, the weight of a term is the sum of all weights multiplied by its respective frequency in that tag. Lastly, we needed to consider the efficiency of this algorithm. Dr. Pollett made it clear that the current algorithm is pretty slow and my weighting algorithm should not slow it down any more than it needs to.

At this point I had my marching orders and began coding the weighting algorithm. Since this method would probably be shared amongst many summarizers, I decided to reorganize the code first. I created a folder to hold the summarizers under the lib folder. Then I took the basic summarizer in-lined code and put it in its own file called scrape_summarizer.php and added it to the summarizer folder. I moved the graph-based and centroid-based summarizer to the summarizer folder also. Next, I created a summarizer.php file in the summarizer folder to be the base summarizer for all of the summarizers and added the base addition weight method. After that I modified the two summarizers to extend from the base summarizer and updated the code files that referenced the summarizers in their old location.

After reorganizing the code, I finally started writing the code for the additional weighting algorithm. Although I have used regular expressions for some time, I struggled with them in this implementation. It took a couple of weekly meetings before we had it nailed down. The problems were stemming from the regular expressions ability to count multiple occurrences of the same term in a tag. We decided to take a performance hit for quality results. After the regular expression found a match, the substr_count() method would be used to count the term frequencies in that tag. In order to prove my code would perform correctly, I wrote unit tests. The unit tests prove invaluable. The unit tests uncovered a few bugs within my code. I was able to fix the bugs and get all of the test to pass. I wrote code that would set all of the weights to 1 and feed it a regular expression. I tested each case and put the code back to the way it found it.

Lastly, I needed to regression test the additional weighting algorithm to expose optimal weights that I may use when making the algorithm live. I took what I did with the unit test framework to automate the entire process of updating the code with some weighting value, generating a summary for my 10 test documents, formatted the output for ROUGE, ran ROUGE on the results and repeated the process for some limit. This may sound east but it was not. I was able to create such a system that did one class weighting value at a time and tested its results.

## Results

In conclusion, creating this weighting algorithm was a ton of work. Unfortunately, the work is not done. I was not able to render any results that were improved from the original results. It seemed that no matter what the weight was, the results were consistently lower. The CBS summarizer Average_R value without weighting was 0.76663 and with weigths was 0.24689. Maybe next time around we should incorporate some of the other ideas or change the placement of the additional weight algorithm in the centroid-based summarizer. Although there is no real execution changes to the code, I will still submit an issue to the Mantis BT code repository to be able to continue this work.

CBS Weighted ROUGE Result
---------------------------------------------
11 ROUGE-1 Average_R: 0.24689 (95%-conf.int. 0.12667 - 0.37333)
11 ROUGE-1 Average_P: 0.21218 (95%-conf.int. 0.11607 - 0.31071)
11 ROUGE-1 Average_F: 0.21092 (95%-conf.int. 0.11915 - 0.30427)
---------------------------------------------
11 ROUGE-2 Average_R: 0.14330 (95%-conf.int. 0.05000 - 0.25000)
11 ROUGE-2 Average_P: 0.08230 (95%-conf.int. 0.02857 - 0.14286)
11 ROUGE-2 Average_F: 0.10366 (95%-conf.int. 0.03636 - 0.18182)
---------------------------------------------
11 ROUGE-3 Average_R: 0.06350 (95%-conf.int. 0.03333 - 0.13333)
11 ROUGE-3 Average_P: 0.03175 (95%-conf.int. 0.01667 - 0.06667)
11 ROUGE-3 Average_F: 0.04233 (95%-conf.int. 0.02222 - 0.08889)
---------------------------------------------
11 ROUGE-4 Average_R: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
11 ROUGE-4 Average_P: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
11 ROUGE-4 Average_F: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
---------------------------------------------
11 ROUGE-L Average_R: 0.18470 (95%-conf.int. 0.09365 - 0.29961)
11 ROUGE-L Average_P: 0.21218 (95%-conf.int. 0.11607 - 0.31071)
11 ROUGE-L Average_F: 0.18124 (95%-conf.int. 0.10222 - 0.26444)
---------------------------------------------
11 ROUGE-W-1.2 Average_R: 0.11842 (95%-conf.int. 0.05341 - 0.20543)
11 ROUGE-W-1.2 Average_P: 0.17366 (95%-conf.int. 0.08750 - 0.27679)
11 ROUGE-W-1.2 Average_F: 0.12677 (95%-conf.int. 0.06425 - 0.20894)
---------------------------------------------
11 ROUGE-S* Average_R: 0.07818 (95%-conf.int. 0.03000 - 0.14000)
11 ROUGE-S* Average_P: 0.02791 (95%-conf.int. 0.01071 - 0.05000)
11 ROUGE-S* Average_F: 0.04038 (95%-conf.int. 0.01579 - 0.07427)
---------------------------------------------
11 ROUGE-SU* Average_R: 0.13299 (95%-conf.int. 0.05656 - 0.22476)
11 ROUGE-SU* Average_P: 0.09831 (95%-conf.int. 0.03429 - 0.20513)
11 ROUGE-SU* Average_F: 0.08041 (95%-conf.int. 0.03828 - 0.12862)

CBS ROUGE Configuration File
<ROUGE-EVAL version="1.0">
<EVAL ID="1">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
</PEERS>
<MODELS>
<M ID="A">
</MODELS>
</EVAL>
<EVAL ID="2">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
DeliveringaprojectandpresentingtoamultilevelaudienceCS200WBlog.html</P>
</PEERS>
<MODELS>
<M ID="A">
DeliveringaprojectandpresentingtoamultilevelaudienceCS200WBlog.html</M>
</MODELS>
</EVAL>
<EVAL ID="3">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
HandingoffaprojecttoaclientwhataretherisksandchallengesCS20.html</P>
</PEERS>
<MODELS>
<M ID="A">
HandingoffaprojecttoaclientwhataretherisksandchallengesCS20.html</M>
</MODELS>
</EVAL>
<EVAL ID="4">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
</PEERS>
<MODELS>
<M ID="A">
</MODELS>
</EVAL>
<EVAL ID="5">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
SocialMediaandBrandingCS200WBlog.html</P>
</PEERS>
<MODELS>
<M ID="A">
SocialMediaandBrandingCS200WBlog.html</M>
</MODELS>
</EVAL>
<EVAL ID="6">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
TheAgileTeamandwhatisaBacklogWhataretheyforandwhyaretheyimp.html</P>
</PEERS>
<MODELS>
<M ID="A">
TheAgileTeamandwhatisaBacklogWhataretheyforandwhyaretheyimp.html</M>
</MODELS>
</EVAL>
<EVAL ID="7">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
</PEERS>
<MODELS>
<M ID="A">
</MODELS>
</EVAL>
<EVAL ID="8">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
WhatisAgileandwhatareuserstoriesCS200WBlog.html</P>
</PEERS>
<MODELS>
<M ID="A">
WhatisAgileandwhatareuserstoriesCS200WBlog.html</M>
</MODELS>
</EVAL>
<EVAL ID="9">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
WhatisanAgileSprintRetrospectiveAbusylifeofagirlgamer.html</P>
</PEERS>
<MODELS>
<M ID="A">
WhatisanAgileSprintRetrospectiveAbusylifeofagirlgamer.html</M>
</MODELS>
</EVAL>
<EVAL ID="10">
<PEER-ROOT>
./Yioop-testGraphBased/systemsAreGenerated </PEER-ROOT>
<MODEL-ROOT>
./Yioop-testGraphBased/modelsAreHuman </MODEL-ROOT>
<INPUT-FORMAT TYPE="SEE">
</INPUT-FORMAT>
<PEERS>
<P ID="11">
WhatisanAgileSprintRetrospectiveCS200WBlog.html</P>
</PEERS>
<MODELS>
<M ID="A">
WhatisanAgileSprintRetrospectiveCS200WBlog.html</M>
</MODELS>
</EVAL>
</ROUGE-EVAL>

Human Generated Input System Generated Input
<html>
<body bgcolor="white">
<a name="1">[1]</a> <a href="#1" id=1>Agile tasks lists, what does done </a>
<a name="2">[2]</a> <a href="#2" id=2>In life just as at work, you may have had someone ask you the dreaded question Are you done yet?</a>
<a name="3">[3]</a> <a href="#3" id=3>That is why to ensure transparency and improve quality in an agile environment, the definition of done (DoD) must be clearly defined and have a consensus among the team.</a>
<a name="4">[4]</a> <a href="#4" id=4>We will walk through what the DoD is, an example of how to create a DoD and what value it brings to the sprint cycle.</a>
<a name="5">[5]</a> <a href="#5" id=5>According to the Agile Alliance and Institute (2014) the DoD is a list of criteria which must be met before a product increment often a user story is considered done. </a>
<a name="6">[6]</a> <a href="#6" id=6>The most important feature of the DoD is it keeps hidden work or scope creep from happening.</a>
<a name="7">[7]</a> <a href="#7" id=7>The DoD gets iteratively worked just like the user stories within each sprint. According to Scrum.org (2013), the Definition of Done is not changed during a Sprint, but should change periodically between Sprints to reflect improvements the Development Team has made in its processes and capabilities to deliver software.</a>
<a name="8">[8]</a> <a href="#8" id=8>Moreover, you will find the risk is reduced, teams are more focused, and communication between the client is better.</a>
<a name="9">[9]</a> <a href="#9" id=9>By making the DoD a way of life and committing to exceptional work, the client will be able to visualize what complete really is. </a>
</body>
</html>
<html>
<body bgcolor="white">
<a name="1">[1]</a> <a href="#1" id=1>Agile tasks lists, what does done mean in Agile.</a>
<a name="2">[2]</a> <a href="#2" id=2>Agile tasks lists, what does done mean in Agile.</a>
<a name="3">[3]</a> <a href="#3" id=3>According to the Agile Alliance and Institute 2014 the DoD is a list of criteria which must be met before a product increment often a user story is considered done.</a>
<a name="4">[4]</a> <a href="#4" id=4>It can be in the form of a “Done List” or a “Done Checklist”.</a>
<a name="5">[5]</a> <a href="#5" id=5>Furthermore, when making the list, you can fill the list s with initial categories and then put items into those categories.</a>
<a name="6">[6]</a> <a href="#6" id=6>The story completed category could contain tasks like programming done, testing done and/or documentation done.</a>
<a name="7">[7]</a> <a href="#7" id=7>It is recommended that the team members and the stakeholders come up with a current or realistic version and an ideal version.</a>
<a name="8">[8]</a> <a href="#8" id=8>In conclusion, a good DoD will change the “Are you done yet?” question into a much better question “Is the story completed yet?” Moreover, you will find the risk is reduced, teams are more focused, and communication between the client is better.</a>
<a name="9">[9]</a> <a href="#9" id=9>By making the DoD a way of life and committing to exceptional work, the client will be able to visualize what complete really is.</a>
<a name="11">[11]</a> <a href="#11" id=11>References.</a>
<a name="12">[12]</a> <a href="#12" id=12>Agile, A.</a>
<a name="13">[13]</a> <a href="#13" id=13>A 2013, February 13.</a>
<a name="14">[14]</a> <a href="#14" id=14>Guide to Agile Practices.</a>
<a name="16">[16]</a> <a href="#16" id=16>Scrum.org.</a>
<a name="17">[17]</a> <a href="#17" id=17>2014, April 25.</a>
<a name="18">[18]</a> <a href="#18" id=18>Improving the Profession of Software Development.</a>
<a name="20">[20]</a> <a href="#20" id=20>Life Cycle Image.</a>
<a name="22">[22]</a> <a href="#22" id=22>Definition of Done Image.</a>
<a name="23">[23]</a> <a href="#23" id=23>https://www.scrumalliance.org/community/articles/2014/january/why-using-a-definition-of- done-in-an-agile-project.</a>
<a name="24">[24]</a> <a href="#24" id=24>Categories: Agile.</a>
<a name="26">[26]</a> <a href="#26" id=26>Comment.</a>
<a name="27">[27]</a> <a href="#27" id=27>Search.</a>
<a name="28">[28]</a> <a href="#28" id=28>Join Me On Facebook.</a>
<a name="29">[29]</a> <a href="#29" id=29>Recent Posts.</a>
<a name="30">[30]</a> <a href="#30" id=30>LinkedIn profiles, how to use them, how to market yourself, how to network.</a>
<a name="31">[31]</a> <a href="#31" id=31>Find me on YouTube.</a>
<a name="32">[32]</a> <a href="#32" id=32>Categories.</a>
<a name="33">[33]</a> <a href="#33" id=33>Agile.</a>
<a name="34">[34]</a> <a href="#34" id=34>Project Management.</a>
<a name="35">[35]</a> <a href="#35" id=35>Skills.</a>
<a name="36">[36]</a> <a href="#36" id=36>Social Media.</a>
<a name="37">[37]</a> <a href="#37" id=37>Copyright 2015 CS200W Blog.</a>