Project Blog:
May 17, 2016
- We discussed that I will be getting my paper friday or monday
- I will get the corrections in as soon as I can
- Dr. Pollett wants to extend Yioop to support ROUGE
- I showed Dr. Pollett how to run ROUGE
- I will send Dr. Pollett all of my rouge runner scripts and my rouge input files
- I will send Dr. Pollett the C# code for my rouge file prepper
- I successfully completed my defense
- This is the last meeting of the semester
May 10, 2016
- I discussed that I still have not heard from GAPE, I will contact them this week
- We went over the presentation
- We cut some slides
- I need to move the overviews down to the its experiment area
- I moved the Yioop overview slide up
- we removed the future work bullet
- I need to have a stronger conclusion
- On the graph based formula slide try to elaborate on what the sum actually is
May 03, 2016
- We discussed that I need to submit my paper to turnitin.com
- The CS masters website has the information on it in the Project Guidelines page
- We discussed that I have not heard from GAPE
- We talked about scheduling my defense
- I will send my report to my commitee and ask them for good times suggesting May 17th @ 1000
- I then need to fill out the oral defense request form
- It must have two different dates and times
- Not two times on the same day
- I will get my presentation completed
April 26, 2016
- We discussed the overall goal of my thesis
- The goal for summarizing is to get the best summary
- My project was to look at the existing ones and see what is the best approach
- we tried things that had some effect and did not
- We discussed the defense presentation
- I should talk about each experiment briefly one slide overview
- I should talk about the background
- I should Yioop!'s search engine
- I should talk about the types of summarizers
- I should talk about ROUGE in general
- I should talk about what I did in detail coding
- I should talk about the experiments I did
- Then conclusion
- I should be ready to talk for 35 minutes and then 15 minutes of questions
April 19, 2016
April 12, 2016
- I turned in my paper to the English department, so there was nothing to discuss this week
April 05, 2016
- Dr. Pollett did a quick review of my paper
- He suggested in the first paragraph of chapter 2 change Dr. Pollett to Dr. Christopher Pollett at San Jose State University
- He looked at the quotes and noticed I need to fix the start quotes to be two backticks
- He suggested I change Dr. Pollett's algorithm name to something about the average sentences that it compares
- He suggested I check that all Dr. do not have two spaces after it
- I will read the entire SJSU thesis guidelines
- We discussed what is left to write
- I still need to write the abtract, acknowledgement, introduction and conclusion
- Dr. Pollett will review and send me his changes by early Friday
- We looked at the forms we need to submit
- I need to get the approval form to each committee member
March 29, 2016
- No meeting ... Spring Break
March 22, 2016
- Dr. Pollett suggested changing the chapter titles to make it read like a story
- The subheadings should be changed to reflect the original title
- For example, chapter 2s title would be; Understanding the Proprocessing before Summarization
- The aim would be; Stemming Using a Dutch Stemmer
- The overview would be; An overview of how Stemmers Work
- The work performed; Writing and testing a Dutch stemmer or Implementing a Dutch Stemmer
- The results would be; Experiments with a Dutch Stemmer
- I need to swap chapter 2 and 3
- The title of the paper will be; Experiments with and Implementation of a Context Sensitive Text Summarizer
- We discussed that I will be adding more content where necessary to fill in any gaps
- I will include result files and configurations in the appendix
- Update the Human Generated Input and System Generated Input to be ROUGE Human Generated Input and ROUGE System Generated Input
- We discussed the defense
- We want to get it done early so we can stay away from the end of the semester rush
March 15, 2016
- We talked about the content I posted for the sentence compression experiment
- Dr. Pollett was okay with all but one sentence and he suggested I reword it
- We talked about how we will organize my final paper
- I will have to come up with an abstract
- I will have a one to two page introduction
- The section header will be; Intro, Background (consisting of an overview, summarization, ROUGE, stemming, CS 297 work etcetera), Experiments, Integrations and a 2 page conclusion
- I will work on putting the paper together and Dr. Pollett will review it next week
March 08, 2016
- We talked about the sentence compression
- Dr. Pollett checked in the code I wrote for sentence compression
- We discussed about how Yioop stores phrases may have caused the sentence to not be as successful as it could have been
- At this point we are ready to write the paper
- The goal of the paper should be we found some approaches that worked and some things that did not work
- I will start writing the paper in LaTex so I do not run into problems at the end
- I will look at SJSU scholar works to see more examples
March 01, 2016
- We discussed the the sentence compression method I implemented as described on 2/16/2016 using 4 items from the Back to Basics: CLASSY 2006 paper
- Dr. Pollett suggested I change some of my multi-word variable names to conform to his standard
- He also noted that the TextProcessor.php file needs to be updated so it uses the sentence compression algorithm
- While looking at the TextProcessor.php file we noticed that it is only using the basic summarizer
- I will update the TextProcessor.php file to utilize all of the summarizers just like the HtmlProcessor.php file does
- Next I need to run the ROUGE tests, compare the results and write it up
- I also need to create a patch as soon as all of my code is completed
February 23, 2016
- We discussed my progress in general this week
- I was not able to make any progress because of work I was doing for my CS 280 class
- We agreed that I will take a few weeks off from CS 280 so we can get the rest of the CS 299 work completed
February 16, 2016
- We discussed the resources I read about sentence compression
- I found 8 papers on sentence compression
- Most of them were too complex for what we are trying to do
- 1 of them was simple and I will implement it
- My changes will occur as follows
- In the en-us tokenizer put a sentence compression method
- add a static reference to compressing the sentence that calls the current sentence compression method in the tokenizer
- call it PhraseParser::sentenceCompression in each summarizer after each sentence would be added
February 09, 2016
- We discussed the state of deliverable #3 and decided it was a bust
- I will document our experiments with the lanczos algorithm over the next few weeks
- We found that the English department deadline for the master's thesis April 8th
- I will Start writing the thesis by March 11th
- We discussed my last deliverable and it will be on sentence compression
- I will search the Nenkova paper for anything that has been done with sentence compressiion
- Next week go over what I read and finalize what we are going to do for sentence compression
- We are going to experiment with and without sentence compression
- I contacted Shawn Tice by email
- He gave me some solid feedback that I can use on the thesis paper
- Next weekl we will fill out the candidacy form and submit it
December 16, 2015
- We discussed my CS280 Proposal
- We made a few edits and completed it
- I will mail it to the CS department tomorrow
- We discussed the Lanczos algorithm further
- We confirmed what we are already doing wrong
- We looked into the alternative method
- We need to get more clarification from the original paper written by Paige
- I will see if I can find the paper or check it out in the library
December 11, 2015
- We discussed the Hachey and Gong paper
- Neither of them help us with a better method to calculate the SVD
- Dr. Pollett has a book with the another method in it
- Dr. Pollett will scan the section and email it to me
- We will meet again after I read the scanned items Dr. Pollett sends me
December 07, 2015
- We went over my findings with the current Lanczos algorithm I converted
- I found that there is a point were the repeated dot product calculation produces values of infinity
- These results destroy the process to summarize the content
- Dr. Pollett and I read that this algorithm can run into this problem and this algorithm is indeed faithful to its flaw
- We decided not to pursue any farther with it
- We will attempt to implement an approach (Hachey) that does not suffer from its know failure or move on and call Lanczos a bust
- We talked about what needs to be done for CS280
- We wrote up a summary so I can write a proposal and turn it into the department
- We wrote my paragraph summarizing the semester
November 30, 2015
- I demonstrated the Lanczos algorithm PHP code I converted from Java
- I showed that I can produce the same result from the same input
- We also tried to run the original summarizer on some arbitrary text
- It did not run as expected
- We were getting NaN results and I am going to look into it
- We reviewed my latest patch and Dr. Pollett committed the code to the global repository
- We talked about next semester
- I need to post my deliverables
- I need to write a conclusion of this semester
- Next week we will come up with the CS280 summary for the proposal submittal on the CMS Detector GUI topic
November 23, 2015
- I sent the drafts of the deliverables to Dr. Pollett
- He took a quick look at my papers and said they are okay
- I had questions about the QR decompose method that was being used in the Lanczos algorithm I was converting
- Dr. Pollett explained the QR decompose to me and it made sense why my conversion of the code was failing
- Dr. Pollett recommends padding the triangular array with 0s below the diagonal
- We are near the end of the semester so we will spend time discussing what is next in our next meeting
- I will continue to work on converting the Lanczos algorithm and hopefully complete the conversion
November 16, 2015
- We discussed my current status on writing the Lanczos algorithm
- I had converted the code and are troubleshooting a bug
- We troubleshot a little with no success
- We discussed the write-up for deliverable 1 and 2
- This week I will proofread what I have wrote and send them to Dr. Pollett for review
November 09, 2015
- We discussed the current status of the thesis
- I completed the updates to the patch I submitted
- I started writing the PHP code for implementing the Lanczos algorithm
November 02, 2015
- We discussed the current status of the thesis
- I still need to make updates to the patch I submitted
- I wrote a rough draft for deliverable 2
- I looked into the the Lanczos algorithm and understand what needs to be done
October 26, 2015
- We discussed the current status of the thesis
- I need to make updates to the patch I submitted
- I need to resave the test input files because they have strange unicode characters in them
- I need to remove the dependency on the simple_dom class
- I need to write a rough draft for deliverable 2
- For Deliverable 3, we are going to choose to implement the Lanczos algorithm instead
October 26, 2015
- We discussed the current status of the thesis
- I need to make updates to the patch I submitted
- I need to resave the test input files because they have strange unicode characters in them
- I need to remove the dependency on the simple_dom class
- I need to write a rough draft for deliverable 2
- For Deliverable 3, we are going to choose to implement the Lanczos algorithm instead
October 19, 2015
October 12, 2015
- We looked at the html processor and I showed that all summerizers are using the detector
- We looked that ROUGE results for to compare against the non detected scores
- We looked the DUC scores and noticed the graph one had the best numbers for the ROUGE 1 test
- Next I will create a patch and write up the documentation
October 05, 2015
- We discussed if I can create another CMS detector
- I will try to create a CMS detector for Drupal
- We discussed the work I have done to figure out the DUC document ROUGE tests
- I will finish the code to get the summaries in order and run the DUC document ROUGE tests for each summarizer
- We discussed how we can quantify and understand why the weighting is not affecting the results
- Up to this point, tag weighting or learning weights is not going to buy us anything
- We may try looking into weighting based on parts of speech
- We think we know what we can quantify
- Scores do not increase with weighting but went way up with the detector because summarizing the important content is the silver bullet
- The above is quantifiable
- We discussed the coding aspect of the detectors
- Dr. Pollett feels manually coding a detector is fine but can code learn a CMS detection system
- Can we figure out a way to setup such a system so that we do not have to code any further CMS detectors
- If I have time, I would review some Drupal and Wordpress sites to see if you can find the tags that are in the content div and count them
September 28, 2015
- I have completed the first CMS detector
- It is for Wordpress because the documents we are using for our sample just happen to be built by Wordpress
- I have also written tests for the Wordpress CMS detector
- The test checks the headers of 8 files to see if they are Wordpress or not
- The files do not have any content in the body only the headers
- The tests were difficult because I had to configure composer and PhantomJS
- We discussed how CMS framework detectors will be enabled
- I will change the CMS detector files to have a distinct naming convention
- Each detector will be loaded that it in the directory based on that distinct file name
- We discussed that each detector will have a method that will return the important section of the html to remove extraneous data from being summarized
- We discussed that each detector will add additional weights to content that is in certain tags based on the work we did last semester
- The only caveat is that I will remove the a tag from the query, then weight each in increments of 1 for starters
- I figured out how the DUC documents are constructed
- There are a cluster of documents that are summarized as a whole
- Then the first 100 words are extracted for the summary used in the ROUGE tests
- When I do the DUC ROUGE test, I will remove the summaries from the other contestants and add ours
- To get to the ROUGE tests, code needs to be written so I can run our summarizer against the data
- The cluster of documents are xml files with a .scu extension
- I wrote code to convert the .scu files to a flat text file
- Dr. Pollett suggested I change my uppercase boolean values to lower case
September 21, 2015
- We discussed the ROUGE results of the new Weighted Centroid Summarizer (WCS)
- The results were better than my Graph Based Summarizer (GBS) but not the Centroid Based Summarizer (CBS) of Basic Summarizer (BS)
- We discussed the next steps for the WCS
- I will work on a framework that can detect a web pages framework
- For example, a site could be built using Wordpress or Drupal etc
- Then we would check to see where each framework puts its most important content and weight that content higher
- Dr. Pollett suggested I start looking for where the getPages() method is called and insert the code there and pass its detection downstream
- After the work is completed I will rerun the ROUGE tests
- If I get time I will look more into the DUC data
September 14, 2015
- We reviewed the code I wrote up to this point
- Dr. Pollett feels the code is solid and now we want to work on adding weights
- Dr. Pollett would like me to write code to detect a page's framework
- For example, detect if a page was deploy via wikipedia or wordpress
- After the framework is detected we would do searches in the page source
- External configurations would be needed, because certain tags would weigh more for different frameworks
- Each tag of importance would then get an increaseed weight
- This would also help gather statistics about what percentage of web sites are really wordpress or wikipedia etcetera
- Since the code is good up to this point, I will run the rouge tests for the weighted summarizer and see how the results compare to the other Yioop summarizers
- Lastly, we looked more at te DUC data
- We still have not found a good way to automate the extraction of the content and their summaries
- We did find that there is some correlation between the file names of the summary and the folder of the content
September 09, 2015
- We discussed what was left to weighted centroid summarizer
- We looked deep into the current summarizer because Dr. Pollett thought I redid what was already done in the current centroid summarizer
- After an in depth look at the current centroid summarizer, we found out that it was not what I had written
- It was something similar using Inverse Document Frequency (IDF) and cosine similarity
- I am to create a vector of frequencies in each sentence and then calculate how close each sentence resembles the average sentence
- We got the DUC documents
- We looked at the documents briefly and chose to look at them more closely during the next meeting
August 31, 2015
- Discussed next weeks meeting day because Monday September 7 is Labor Day
- We will meet on Wed next week instead
- We discussed getting access to the AQUAINT, TIPSTER and TREC datasets
- Dr. Pollett ran it by the chair and he was good with it
- Dr. Pollett filled out the forms
- I will scan them when I get home and email them to Dr. Pollett
- We discussed the Yioop implementation of Composer
- Yioop uses it to keep its libraries current
- I will be referencing it configuration when I add code to the Yioop repository
- Since we have to wait for the datasets to be provided, we discussed working on weighting the centroid based algorithm
- I will create a new WeightedCentroidBased summarizer
- I need to create or verify:
- There is a vector where each row in the vector (associative array) is the frequencies of the words in that sentence
- There is a normalized vector and its goal is to have each value be between zero and one and add up to 1
- To get it normalized we divide the original by the square root of the sum of the values squared for every row (sentence)
- There is an average sentence vector which add the columns and divide by the number of rows to get the average sentence
August 25, 2015
- We confirmed our meeting time to be Monday's at 1200
- We discussed that I need to download the latest Yioop git repository
- I need to check http://www.seekquarry.com/p/Composer to read about the new composer Dr. Pollett Implemented
- I need to start on deliverable 1
- I need to find a large document set
- I was thinking of using the DUC 2002 data set
May 13, 2015
- We went over my report
- I need to fix some typos, reword a few things, rewrite the first three sentences of the introduction and resubmit it
- We discussed that I have confirmed my two instructors to be on my CS299 committee
- We looked over my CS299 proposal
- Dr. Pollett put his rubber stamp on it
- We also talked about the summly summarizing algorithm and that we may not be able to get a hold of it
- I need to start filling out the CS299 paperwork and put it in Dr. Pollett's box to get the proper signatures
May 6, 2015
- We went over the patch I created
- I added the ability to output the summarizer results to a file
- Dr. Pollett suggested I move the Boolean flag and output folder variable to the top of the file as constants
- Dr. Pollett also scanned through the code suggested I change my camel case variable names to multi-word variable names with an underscore between each word
- We discussed my CS297 final report
- Dr. Pollett will review it and give me some feedback next week
- At that point, I will update the report resubmit it
- We discussed what is needed to begin CS299
- By next week I need to have solicited instructors to be on my CS 299 committee together
- I need 2 instructors
- I also need to fill out the 2 page CS299 proposal template
- I will use the last paragraph of my final report for CS299 and bullet points
Apr 29, 2015
- We reviewed the regex again because it was not matching multiples
- We decided to add parenthesis to the inside of the anchor tags; (?s)<a.*?>(.*?the.*?)<\/a> and then use substr_count() function to get an accurate frequency count
- I will update the code and the test cases to match
- We talked about the CS 297 report and the CS 299 topic
- I need to use the following format
- A one page intro
- A paragraph describing each section
- Trim what is posted to two pages for each deliverable
- and a one page conclusion that encompasses what will be done in CS 299
- For CS 299, we want to keep improving on summarizers
- We want to do experiments that will generate a paper
- The key will be to get the paper done well before graduation
- We want to compare a few different summarizer methods one of which is different than any that others that have been looked at
- For example, a graph summarizer, the centroid summarizer with weights and look into the one the teenager sold to Yahoo
- We also want to do more extensive experiments like using millions of documents to basically use a larger data set
Apr 22, 2015
- We discussed and confirmed that I had the correct regular expression to find matching words in the 6 html tags we plan on using for weighting
- We talked about the weighting algorithm in detail because it does not seem to be contributing to new summary results
- I convinced Dr. Pollett that it I am using the weighting algorithm correctly for the graph based summarizer
- Since the weighting algorithm has not produced new results Dr. Pollett is suggesting two experiments
- To make the summaries shorter for example, 500 bytes
- Use the summaries Mangesh Dahale used in his research in order to get a set with more html tags
- Make some unit tests for the weighting algorithm to make sure it is producing the correct output
Apr 15, 2015
- We looked into how to automate web site input to generate summaries for the regression testing
- Dr. Pollet pointed me to the crawl_component.php file to look at the pageOptions method
- $_REQUEST[option_type] value has to be test_options
- $_REQUEST[TESTPAGE] value has to be the data (summary text) to test
- $_REQUEST[page_type] = text/html
- Dr. Pollett suggested I change the name of the basic_summarizer.php file to scrape_summarize.php
- Dr. Pollett suggested I change the base_summarizer.php file to just summarizer.php
- Dr. Pollett suggested I change the regexes to handle new lines
- Dr. Pollett suggested I change the regex not to match on string like: <h1>ssfsd</h1>do not match this<h1>ssfsd</h1>
- Dr. Pollett wants me to calculate all weights then and use the formula: sum of weights times frequency in that tag to get the additional weight
- My Queue server is not starting and I need to check starting the queue server from the command line
Apr 8, 2015
- We talked about the paper I found and the weighting algorithm we are going to implement
- We plan on using 6 categories which was described in the paper
- I will do some regression testing
- The plan is to start each category at 0 and increase each value until we do not see an improvement
- Then repeat by setting the each to the base and increment one category at a time
- After we get the results, I will try to use those values in combination
- We discussed how to organize the summarizers to include the weighting algorithm
- Since the weighting algorithm will be common to all summarizers that use term frequencies I will move the summarizers to their own folder
- The basic summarizer is not in its own file so I will find the basic code and make its own
- Then I will create a base class that has weighting method(s)
- I need to update the few places where it loads these files by searching for the require lines
- We also discussed a long term goal of adding ROUGE to Yioop
- The goal would to get Yioop to participate text retrieval contests and make a paper out of it for the Text REtrieval Conference (TREC)
Apr 1, 2015
- We took a look at the graph-based summarizer I wrote
- Dr. Pollett did a few tests, made some minor modifications and committed the new code to the repository
- We discussed my Deliverable 3 post
- Dr. Pollett made a suggestion to add a for example sentence in the conclusion about the BS and CBS ROUGE results
- We went over the two page rank fixes to some known issues
- Issue one was when all web pages pointing to one and that web page does not point out to other web sites
- Issue two was when you have a disconnected cluster of web pages
- After our discussion, we determined that we did not suffer from those issues and chose not to include the fixes
- We discussed what I would do for Deliverable 4
- We looked into the current centroid code to see where I need to make the changes
- I believe I know where the changes need to be made
- I have to make sure I do not slow down the algorithm, as it is already pretty slow
- The idea is to add weights to certain terms depending on where they are in the html document
- For example increase the word frequency for a word in a h1 tag by 3
- I will do some research to see if there is a method to this madness solved already
- I will check to see what are the most important spots in a html doc by tag type
- I may try to weight by where the word appears; beginning, middle or end
- I could also try to use the sentence length to add weight
- The goal is to increase the frequency of those words thus increasing the weight
- I can also look at Fall 2012 September 12 slide 4 of Dr. Pollett's CS 267 class
Mar 25, 2015
- No meeting ... Spring Break
Mar 18, 2015
- I completed most of the coding on the Graph Based Summarizer
- We discussed the sentence rank algorithm part
- After the discussion we had the code needed to complete the algorithm
- In order to get the new Graph Based summarizer into Yioop we went over how to search the repository
- I will use the code_tool.php script to help with that
- For example, to find where the centroid summarizer is hooked in, search using "php code_tool.php search .. centroid" from the bin directory of the code base
- We also discussed that after I get it into Yioop, I will perform the same summary tests as in Deliverable 1
Mar 11, 2015
- We reviewed the presentation I made for the graph based algorithm I am working on
- We went over the page rank algorithm and adjacency matrix portion thoroughly
- Dr. Pollett recommended I look at Who's #1 and Page Rank and Beyond by Amy Langville and Carl Meyer
- We discussed the code I will be writing
- I will leverage an existing stop words, sentence splitter and punctuation methods in the PhraseParser
- There is also a frequency matrix creator that I may use in the PhraseParser
May 6, 2015
- We went over the patch I created
- I added the ability to output the summarizer results to a file
- Dr. Pollett suggested I move the Boolean flag and output folder variable to the top of the file as constants
- Dr. Pollett also scanned through the code suggested I change my camel case variable names to multi-word variable names with an underscore between each word
- We discussed my CS297 final report
- Dr. Pollett will review it and give me some feedback next week
- At that point, I will update the report resubmit it
- We discussed what is needed to begin CS299
- By next week I need to have solicited instructors to be on my CS 299 committee together
- I need 2 instructors
- I also need to fill out the 2 page CS299 proposal template
- I will use the last paragraph of my final report for CS299 and bullet points
Apr 29, 2015
- We reviewed the regex again because it was not matching multiples
- We decided to add parenthesis to the inside of the anchor tags; (?s)<a.*?>(.*?the.*?)<\/a> and then use substr_count() function to get an accurate frequency count
- I will update the code and the test cases to match
- We talked about the CS 297 report and the CS 299 topic
- I need to use the following format
- A one page intro
- A paragraph describing each section
- Trim what is posted to two pages for each deliverable
- and a one page conclusion that encompasses what will be done in CS 299
- For CS 299, we want to keep improving on summarizers
- We want to do experiments that will generate a paper
- The key will be to get the paper done well before graduation
- We want to compare a few different summarizer methods one of which is different than any that others that have been looked at
- For example, a graph summarizer, the centroid summarizer with weights and look into the one the teenager sold to Yahoo
- We also want to do more extensive experiments like using millions of documents to basically use a larger data set
Apr 22, 2015
- We discussed and confirmed that I had the correct regular expression to find matching words in the 6 html tags we plan on using for weighting
- We talked about the weighting algorithm in detail because it does not seem to be contributing to new summary results
- I convinced Dr. Pollett that it I am using the weighting algorithm correctly for the graph based summarizer
- Since the weighting algorithm has not produced new results Dr. Pollett is suggesting two experiments
- To make the summaries shorter for example, 500 bytes
- Use the summaries Mangesh Dahale used in his research in order to get a set with more html tags
- Make some unit tests for the weighting algorithm to make sure it is producing the correct output
Apr 15, 2015
- We looked into how to automate web site input to generate summaries for the regression testing
- Dr. Pollet pointed me to the crawl_component.php file to look at the pageOptions method
- $_REQUEST[option_type] value has to be test_options
- $_REQUEST[TESTPAGE] value has to be the data (summary text) to test
- $_REQUEST[page_type] = text/html
- Dr. Pollett suggested I change the name of the basic_summarizer.php file to scrape_summarize.php
- Dr. Pollett suggested I change the base_summarizer.php file to just summarizer.php
- Dr. Pollett suggested I change the regexes to handle new lines
- Dr. Pollett suggested I change the regex not to match on string like: <h1>ssfsd</h1>do not match this<h1>ssfsd</h1>
- Dr. Pollett wants me to calculate all weights then and use the formula: sum of weights times frequency in that tag to get the additional weight
- My Queue server is not starting and I need to check starting the queue server from the command line
Apr 8, 2015
- We talked about the paper I found and the weighting algorithm we are going to implement
- We plan on using 6 categories which was described in the paper
- I will do some regression testing
- The plan is to start each category at 0 and increase each value until we do not see an improvement
- Then repeat by setting the each to the base and increment one category at a time
- After we get the results, I will try to use those values in combination
- We discussed how to organize the summarizers to include the weighting algorithm
- Since the weighting algorithm will be common to all summarizers that use term frequencies I will move the summarizers to their own folder
- The basic summarizer is not in its own file so I will find the basic code and make its own
- Then I will create a base class that has weighting method(s)
- I need to update the few places where it loads these files by searching for the require lines
- We also discussed a long term goal of adding ROUGE to Yioop
- The goal would to get Yioop to participate text retrieval contests and make a paper out of it for the Text REtrieval Conference (TREC)
Apr 1, 2015
- We took a look at the graph-based summarizer I wrote
- Dr. Pollett did a few tests, made some minor modifications and committed the new code to the repository
- We discussed my Deliverable 3 post
- Dr. Pollett made a suggestion to add a for example sentence in the conclusion about the BS and CBS ROUGE results
- We went over the two page rank fixes to some known issues
- Issue one was when all web pages pointing to one and that web page does not point out to other web sites
- Issue two was when you have a disconnected cluster of web pages
- After our discussion, we determined that we did not suffer from those issues and chose not to include the fixes
- We discussed what I would do for Deliverable 4
- We looked into the current centroid code to see where I need to make the changes
- I believe I know where the changes need to be made
- I have to make sure I do not slow down the algorithm, as it is already pretty slow
- The idea is to add weights to certain terms depending on where they are in the html document
- For example increase the word frequency for a word in a h1 tag by 3
- I will do some research to see if there is a method to this madness solved already
- I will check to see what are the most important spots in a html doc by tag type
- I may try to weight by where the word appears; beginning, middle or end
- I could also try to use the sentence length to add weight
- The goal is to increase the frequency of those words thus increasing the weight
- I can also look at Fall 2012 September 12 slide 4 of Dr. Pollett's CS 267 class
Mar 25, 2015
- No meeting ... Spring Break
Mar 18, 2015
- I completed most of the coding on the Graph Based Summarizer
- We discussed the sentence rank algorithm part
- After the discussion we had the code needed to complete the algorithm
- In order to get the new Graph Based summarizer into Yioop we went over how to search the repository
- I will use the code_tool.php script to help with that
- For example, to find where the centroid summarizer is hooked in, search using "php code_tool.php search .. centroid" from the bin directory of the code base
- We also discussed that after I get it into Yioop, I will perform the same summary tests as in Deliverable 1
Mar 11, 2015
- We reviewed the presentation I made for the graph based algorithm I am working on
- We went over the page rank algorithm and adjacency matrix portion thoroughly
- Dr. Pollett recommended I look at Who's #1 and Page Rank and Beyond by Amy Langville and Carl Meyer
- We discussed the code I will be writing
- I will leverage an existing stop words, sentence splitter and punctuation methods in the PhraseParser
- There is also a frequency matrix creator that I may use in the PhraseParser
Mar 6, 2015
- Dr. Pollett helped me generate the patch for the Dutch stemmer I wrote
- I updated the bug/issue I opened to include the patch
- Dr. Pollett gave the code his initial okay
- I was reminded that I need to complete the contributor agreement
- After creating the first patch, we looked at what needed to be changed to have nl (Nederlands) as a locale as a default option
- I will edit the createdb.php file and the config.php file with the appropriate changes
- I will generate another patch and post it to the bug/issue
- We discussed the new deliverable
- I need to use Google Scholar to find papers that have cited "The Automatic Creation of Literature Abstracts" to get better references
- After finding some new references, I will create a presentation (at most 8 slides) that summarizes the paper in order to demonstrate my understanding
- We reviewed my deliverable #2 posting
- Dr. Pollett gave me a few suggestions for referencing websites for better SEO
- Dr. Pollett also suggested I have an ending period on my references section
Feb 25, 2015
- We looked at my configuration.ini file for my Nl locale again
- I was instructed to remove all single quotes from the file
- I was instructed to remove all of the backslashes
- We set my locale to Nl on my instance of the Yioop search engine and tested my locale
- During the test we noticed some of the converted strings needed attention
- We ran a search and it errored because of the missing segment() method which I will fix
- We discussed entering an issue into the Mantis site
- I created an account and Dr. Pollett upgraded me so I can post patches
- I will follow the instructions for making a patch
- We looked at validating the html files I have been posting i.e bio, proposal, etc
- I will remove all of the validation errors on all existing pages
- I will continue to validate any page I add or update
- I will email Dr. Pollett the paper I found for deliverable #3
Feb 18, 2015:
- We discussed the work I have done on the stemmer so far
- We discussed that I need to add unit tests based on the current Yioop standards
- We looked at my locale configuration.ini file
- The Yioop UI is not showing all of them translated
- I have to quote every string and see if the Yioop UI shows all of the settings
- We discussed about the Yioop contributor agreement
- I will read it and sign or not sign it
- If I do sign it, I will be able to upload my code for Dr. Pollett to review and post to the production
release
- If I do not sign it I will not be able to get any code I write into the production release
Feb 11, 2015:
- We reviewed my ROUGE configuration and what it took to get meaningful results
- We reviewed my results for deliverable 1; particularly the ROUGE results
- We discussed what would be due for each deliverable
- We discussed what needs to be worked on for the next deliverable
- We looked at what code needs to manipulated within the Yioop search engine to incoroporate the new stemmer
- We discussed how to create a new locale with the Yioop search engine
- We discussed to get some sample code from snowball.tartarus.org
Feb 4, 2015:
- We got this site up and we went over how to view/modify its items
- We decided on a title for the project
- I entered my meeting time in the correct place on the wiki page
- We reviewed the work I have done for my deliverable next week
- We discussed the correct approach for handling my CS297 and CS299
|