Project Blog:

May 17, 2016

  • We discussed that I will be getting my paper friday or monday
  • I will get the corrections in as soon as I can
  • Dr. Pollett wants to extend Yioop to support ROUGE
  • I showed Dr. Pollett how to run ROUGE
  • I will send Dr. Pollett all of my rouge runner scripts and my rouge input files
  • I will send Dr. Pollett the C# code for my rouge file prepper
  • I successfully completed my defense
  • This is the last meeting of the semester

May 10, 2016

  • I discussed that I still have not heard from GAPE, I will contact them this week
  • We went over the presentation
  • We cut some slides
  • I need to move the overviews down to the its experiment area
  • I moved the Yioop overview slide up
  • we removed the future work bullet
  • I need to have a stronger conclusion
  • On the graph based formula slide try to elaborate on what the sum actually is

May 03, 2016

  • We discussed that I need to submit my paper to turnitin.com
  • The CS masters website has the information on it in the Project Guidelines page
  • We discussed that I have not heard from GAPE
  • We talked about scheduling my defense
  • I will send my report to my commitee and ask them for good times suggesting May 17th @ 1000
  • I then need to fill out the oral defense request form
  • It must have two different dates and times
  • Not two times on the same day
  • I will get my presentation completed

April 26, 2016

  • We discussed the overall goal of my thesis
  • The goal for summarizing is to get the best summary
  • My project was to look at the existing ones and see what is the best approach
  • we tried things that had some effect and did not
  • We discussed the defense presentation
  • I should talk about each experiment briefly one slide overview
  • I should talk about the background
  • I should Yioop!'s search engine
  • I should talk about the types of summarizers
  • I should talk about ROUGE in general
  • I should talk about what I did in detail coding
  • I should talk about the experiments I did
  • Then conclusion
  • I should be ready to talk for 35 minutes and then 15 minutes of questions

April 19, 2016

  • No meeting this week

April 12, 2016

  • I turned in my paper to the English department, so there was nothing to discuss this week

April 05, 2016

  • Dr. Pollett did a quick review of my paper
  • He suggested in the first paragraph of chapter 2 change Dr. Pollett to Dr. Christopher Pollett at San Jose State University
  • He looked at the quotes and noticed I need to fix the start quotes to be two backticks
  • He suggested I change Dr. Pollett's algorithm name to something about the average sentences that it compares
  • He suggested I check that all Dr. do not have two spaces after it
  • I will read the entire SJSU thesis guidelines
  • We discussed what is left to write
  • I still need to write the abtract, acknowledgement, introduction and conclusion
  • Dr. Pollett will review and send me his changes by early Friday
  • We looked at the forms we need to submit
  • I need to get the approval form to each committee member

March 29, 2016

  • No meeting ... Spring Break

March 22, 2016

  • Dr. Pollett suggested changing the chapter titles to make it read like a story
  • The subheadings should be changed to reflect the original title
  • For example, chapter 2s title would be; Understanding the Proprocessing before Summarization
  • The aim would be; Stemming Using a Dutch Stemmer
  • The overview would be; An overview of how Stemmers Work
  • The work performed; Writing and testing a Dutch stemmer or Implementing a Dutch Stemmer
  • The results would be; Experiments with a Dutch Stemmer
  • I need to swap chapter 2 and 3
  • The title of the paper will be; Experiments with and Implementation of a Context Sensitive Text Summarizer
  • We discussed that I will be adding more content where necessary to fill in any gaps
  • I will include result files and configurations in the appendix
  • Update the Human Generated Input and System Generated Input to be ROUGE Human Generated Input and ROUGE System Generated Input
  • We discussed the defense
  • We want to get it done early so we can stay away from the end of the semester rush

March 15, 2016

  • We talked about the content I posted for the sentence compression experiment
  • Dr. Pollett was okay with all but one sentence and he suggested I reword it
  • We talked about how we will organize my final paper
  • I will have to come up with an abstract
  • I will have a one to two page introduction
  • The section header will be; Intro, Background (consisting of an overview, summarization, ROUGE, stemming, CS 297 work etcetera), Experiments, Integrations and a 2 page conclusion
  • I will work on putting the paper together and Dr. Pollett will review it next week

March 08, 2016

  • We talked about the sentence compression
  • Dr. Pollett checked in the code I wrote for sentence compression
  • We discussed about how Yioop stores phrases may have caused the sentence to not be as successful as it could have been
  • At this point we are ready to write the paper
  • The goal of the paper should be we found some approaches that worked and some things that did not work
  • I will start writing the paper in LaTex so I do not run into problems at the end
  • I will look at SJSU scholar works to see more examples

March 01, 2016

  • We discussed the the sentence compression method I implemented as described on 2/16/2016 using 4 items from the Back to Basics: CLASSY 2006 paper
  • Dr. Pollett suggested I change some of my multi-word variable names to conform to his standard
  • He also noted that the TextProcessor.php file needs to be updated so it uses the sentence compression algorithm
  • While looking at the TextProcessor.php file we noticed that it is only using the basic summarizer
  • I will update the TextProcessor.php file to utilize all of the summarizers just like the HtmlProcessor.php file does
  • Next I need to run the ROUGE tests, compare the results and write it up
  • I also need to create a patch as soon as all of my code is completed

February 23, 2016

  • We discussed my progress in general this week
  • I was not able to make any progress because of work I was doing for my CS 280 class
  • We agreed that I will take a few weeks off from CS 280 so we can get the rest of the CS 299 work completed

February 16, 2016

  • We discussed the resources I read about sentence compression
  • I found 8 papers on sentence compression
  • Most of them were too complex for what we are trying to do
  • 1 of them was simple and I will implement it
  • My changes will occur as follows
  • In the en-us tokenizer put a sentence compression method
  • add a static reference to compressing the sentence that calls the current sentence compression method in the tokenizer
  • call it PhraseParser::sentenceCompression in each summarizer after each sentence would be added

February 09, 2016

  • We discussed the state of deliverable #3 and decided it was a bust
  • I will document our experiments with the lanczos algorithm over the next few weeks
  • We found that the English department deadline for the master's thesis April 8th
  • I will Start writing the thesis by March 11th
  • We discussed my last deliverable and it will be on sentence compression
  • I will search the Nenkova paper for anything that has been done with sentence compressiion
  • Next week go over what I read and finalize what we are going to do for sentence compression
  • We are going to experiment with and without sentence compression
  • I contacted Shawn Tice by email
  • He gave me some solid feedback that I can use on the thesis paper
  • Next weekl we will fill out the candidacy form and submit it

December 16, 2015

  • We discussed my CS280 Proposal
  • We made a few edits and completed it
  • I will mail it to the CS department tomorrow
  • We discussed the Lanczos algorithm further
  • We confirmed what we are already doing wrong
  • We looked into the alternative method
  • We need to get more clarification from the original paper written by Paige
  • I will see if I can find the paper or check it out in the library

December 11, 2015

  • We discussed the Hachey and Gong paper
  • Neither of them help us with a better method to calculate the SVD
  • Dr. Pollett has a book with the another method in it
  • Dr. Pollett will scan the section and email it to me
  • We will meet again after I read the scanned items Dr. Pollett sends me

December 07, 2015

  • We went over my findings with the current Lanczos algorithm I converted
  • I found that there is a point were the repeated dot product calculation produces values of infinity
  • These results destroy the process to summarize the content
  • Dr. Pollett and I read that this algorithm can run into this problem and this algorithm is indeed faithful to its flaw
  • We decided not to pursue any farther with it
  • We will attempt to implement an approach (Hachey) that does not suffer from its know failure or move on and call Lanczos a bust
  • We talked about what needs to be done for CS280
  • We wrote up a summary so I can write a proposal and turn it into the department
  • We wrote my paragraph summarizing the semester

November 30, 2015

  • I demonstrated the Lanczos algorithm PHP code I converted from Java
  • I showed that I can produce the same result from the same input
  • We also tried to run the original summarizer on some arbitrary text
  • It did not run as expected
  • We were getting NaN results and I am going to look into it
  • We reviewed my latest patch and Dr. Pollett committed the code to the global repository
  • We talked about next semester
  • I need to post my deliverables
  • I need to write a conclusion of this semester
  • Next week we will come up with the CS280 summary for the proposal submittal on the CMS Detector GUI topic

November 23, 2015

  • I sent the drafts of the deliverables to Dr. Pollett
  • He took a quick look at my papers and said they are okay
  • I had questions about the QR decompose method that was being used in the Lanczos algorithm I was converting
  • Dr. Pollett explained the QR decompose to me and it made sense why my conversion of the code was failing
  • Dr. Pollett recommends padding the triangular array with 0s below the diagonal
  • We are near the end of the semester so we will spend time discussing what is next in our next meeting
  • I will continue to work on converting the Lanczos algorithm and hopefully complete the conversion

November 16, 2015

  • We discussed my current status on writing the Lanczos algorithm
  • I had converted the code and are troubleshooting a bug
  • We troubleshot a little with no success
  • We discussed the write-up for deliverable 1 and 2
  • This week I will proofread what I have wrote and send them to Dr. Pollett for review

November 09, 2015

  • We discussed the current status of the thesis
  • I completed the updates to the patch I submitted
  • I started writing the PHP code for implementing the Lanczos algorithm

November 02, 2015

  • We discussed the current status of the thesis
  • I still need to make updates to the patch I submitted
  • I wrote a rough draft for deliverable 2
  • I looked into the the Lanczos algorithm and understand what needs to be done

October 26, 2015

  • We discussed the current status of the thesis
  • I need to make updates to the patch I submitted
  • I need to resave the test input files because they have strange unicode characters in them
  • I need to remove the dependency on the simple_dom class
  • I need to write a rough draft for deliverable 2
  • For Deliverable 3, we are going to choose to implement the Lanczos algorithm instead

October 26, 2015

  • We discussed the current status of the thesis
  • I need to make updates to the patch I submitted
  • I need to resave the test input files because they have strange unicode characters in them
  • I need to remove the dependency on the simple_dom class
  • I need to write a rough draft for deliverable 2
  • For Deliverable 3, we are going to choose to implement the Lanczos algorithm instead

October 19, 2015

  • No meeting this week

October 12, 2015

  • We looked at the html processor and I showed that all summerizers are using the detector
  • We looked that ROUGE results for to compare against the non detected scores
  • We looked the DUC scores and noticed the graph one had the best numbers for the ROUGE 1 test
  • Next I will create a patch and write up the documentation

October 05, 2015

  • We discussed if I can create another CMS detector
  • I will try to create a CMS detector for Drupal
  • We discussed the work I have done to figure out the DUC document ROUGE tests
  • I will finish the code to get the summaries in order and run the DUC document ROUGE tests for each summarizer
  • We discussed how we can quantify and understand why the weighting is not affecting the results
  • Up to this point, tag weighting or learning weights is not going to buy us anything
  • We may try looking into weighting based on parts of speech
  • We think we know what we can quantify
  • Scores do not increase with weighting but went way up with the detector because summarizing the important content is the silver bullet
  • The above is quantifiable
  • We discussed the coding aspect of the detectors
  • Dr. Pollett feels manually coding a detector is fine but can code learn a CMS detection system
  • Can we figure out a way to setup such a system so that we do not have to code any further CMS detectors
  • If I have time, I would review some Drupal and Wordpress sites to see if you can find the tags that are in the content div and count them

September 28, 2015

  • I have completed the first CMS detector
  • It is for Wordpress because the documents we are using for our sample just happen to be built by Wordpress
  • I have also written tests for the Wordpress CMS detector
  • The test checks the headers of 8 files to see if they are Wordpress or not
  • The files do not have any content in the body only the headers
  • The tests were difficult because I had to configure composer and PhantomJS
  • We discussed how CMS framework detectors will be enabled
  • I will change the CMS detector files to have a distinct naming convention
  • Each detector will be loaded that it in the directory based on that distinct file name
  • We discussed that each detector will have a method that will return the important section of the html to remove extraneous data from being summarized
  • We discussed that each detector will add additional weights to content that is in certain tags based on the work we did last semester
  • The only caveat is that I will remove the a tag from the query, then weight each in increments of 1 for starters
  • I figured out how the DUC documents are constructed
  • There are a cluster of documents that are summarized as a whole
  • Then the first 100 words are extracted for the summary used in the ROUGE tests
  • When I do the DUC ROUGE test, I will remove the summaries from the other contestants and add ours
  • To get to the ROUGE tests, code needs to be written so I can run our summarizer against the data
  • The cluster of documents are xml files with a .scu extension
  • I wrote code to convert the .scu files to a flat text file
  • Dr. Pollett suggested I change my uppercase boolean values to lower case

September 21, 2015

  • We discussed the ROUGE results of the new Weighted Centroid Summarizer (WCS)
  • The results were better than my Graph Based Summarizer (GBS) but not the Centroid Based Summarizer (CBS) of Basic Summarizer (BS)
  • We discussed the next steps for the WCS
  • I will work on a framework that can detect a web pages framework
  • For example, a site could be built using Wordpress or Drupal etc
  • Then we would check to see where each framework puts its most important content and weight that content higher
  • Dr. Pollett suggested I start looking for where the getPages() method is called and insert the code there and pass its detection downstream
  • After the work is completed I will rerun the ROUGE tests
  • If I get time I will look more into the DUC data

September 14, 2015

  • We reviewed the code I wrote up to this point
  • Dr. Pollett feels the code is solid and now we want to work on adding weights
  • Dr. Pollett would like me to write code to detect a page's framework
  • For example, detect if a page was deploy via wikipedia or wordpress
  • After the framework is detected we would do searches in the page source
  • External configurations would be needed, because certain tags would weigh more for different frameworks
  • Each tag of importance would then get an increaseed weight
  • This would also help gather statistics about what percentage of web sites are really wordpress or wikipedia etcetera
  • Since the code is good up to this point, I will run the rouge tests for the weighted summarizer and see how the results compare to the other Yioop summarizers
  • Lastly, we looked more at te DUC data
  • We still have not found a good way to automate the extraction of the content and their summaries
  • We did find that there is some correlation between the file names of the summary and the folder of the content

September 09, 2015

  • We discussed what was left to weighted centroid summarizer
  • We looked deep into the current summarizer because Dr. Pollett thought I redid what was already done in the current centroid summarizer
  • After an in depth look at the current centroid summarizer, we found out that it was not what I had written
  • It was something similar using Inverse Document Frequency (IDF) and cosine similarity
  • I am to create a vector of frequencies in each sentence and then calculate how close each sentence resembles the average sentence
  • We got the DUC documents
  • We looked at the documents briefly and chose to look at them more closely during the next meeting

August 31, 2015

  • Discussed next weeks meeting day because Monday September 7 is Labor Day
  • We will meet on Wed next week instead
  • We discussed getting access to the AQUAINT, TIPSTER and TREC datasets
  • Dr. Pollett ran it by the chair and he was good with it
  • Dr. Pollett filled out the forms
  • I will scan them when I get home and email them to Dr. Pollett
  • We discussed the Yioop implementation of Composer
  • Yioop uses it to keep its libraries current
  • I will be referencing it configuration when I add code to the Yioop repository
  • Since we have to wait for the datasets to be provided, we discussed working on weighting the centroid based algorithm
  • I will create a new WeightedCentroidBased summarizer
  • I need to create or verify:
  • There is a vector where each row in the vector (associative array) is the frequencies of the words in that sentence
  • There is a normalized vector and its goal is to have each value be between zero and one and add up to 1
  • To get it normalized we divide the original by the square root of the sum of the values squared for every row (sentence)
  • There is an average sentence vector which add the columns and divide by the number of rows to get the average sentence

August 25, 2015

  • We confirmed our meeting time to be Monday's at 1200
  • We discussed that I need to download the latest Yioop git repository
  • I need to check http://www.seekquarry.com/p/Composer to read about the new composer Dr. Pollett Implemented
  • I need to start on deliverable 1
  • I need to find a large document set
  • I was thinking of using the DUC 2002 data set

May 13, 2015

  • We went over my report
  • I need to fix some typos, reword a few things, rewrite the first three sentences of the introduction and resubmit it
  • We discussed that I have confirmed my two instructors to be on my CS299 committee
  • We looked over my CS299 proposal
  • Dr. Pollett put his rubber stamp on it
  • We also talked about the summly summarizing algorithm and that we may not be able to get a hold of it
  • I need to start filling out the CS299 paperwork and put it in Dr. Pollett's box to get the proper signatures

May 6, 2015

  • We went over the patch I created
  • I added the ability to output the summarizer results to a file
  • Dr. Pollett suggested I move the Boolean flag and output folder variable to the top of the file as constants
  • Dr. Pollett also scanned through the code suggested I change my camel case variable names to multi-word variable names with an underscore between each word
  • We discussed my CS297 final report
  • Dr. Pollett will review it and give me some feedback next week
  • At that point, I will update the report resubmit it
  • We discussed what is needed to begin CS299
  • By next week I need to have solicited instructors to be on my CS 299 committee together
  • I need 2 instructors
  • I also need to fill out the 2 page CS299 proposal template
  • I will use the last paragraph of my final report for CS299 and bullet points

Apr 29, 2015

  • We reviewed the regex again because it was not matching multiples
  • We decided to add parenthesis to the inside of the anchor tags; (?s)<a.*?>(.*?the.*?)<\/a> and then use substr_count() function to get an accurate frequency count
  • I will update the code and the test cases to match
  • We talked about the CS 297 report and the CS 299 topic
  • I need to use the following format
  • A one page intro
  • A paragraph describing each section
  • Trim what is posted to two pages for each deliverable
  • and a one page conclusion that encompasses what will be done in CS 299
  • For CS 299, we want to keep improving on summarizers
  • We want to do experiments that will generate a paper
  • The key will be to get the paper done well before graduation
  • We want to compare a few different summarizer methods one of which is different than any that others that have been looked at
  • For example, a graph summarizer, the centroid summarizer with weights and look into the one the teenager sold to Yahoo
  • We also want to do more extensive experiments like using millions of documents to basically use a larger data set

Apr 22, 2015

  • We discussed and confirmed that I had the correct regular expression to find matching words in the 6 html tags we plan on using for weighting
  • We talked about the weighting algorithm in detail because it does not seem to be contributing to new summary results
  • I convinced Dr. Pollett that it I am using the weighting algorithm correctly for the graph based summarizer
  • Since the weighting algorithm has not produced new results Dr. Pollett is suggesting two experiments
  • To make the summaries shorter for example, 500 bytes
  • Use the summaries Mangesh Dahale used in his research in order to get a set with more html tags
  • Make some unit tests for the weighting algorithm to make sure it is producing the correct output

Apr 15, 2015

  • We looked into how to automate web site input to generate summaries for the regression testing
  • Dr. Pollet pointed me to the crawl_component.php file to look at the pageOptions method
  • $_REQUEST[option_type] value has to be test_options
  • $_REQUEST[TESTPAGE] value has to be the data (summary text) to test
  • $_REQUEST[page_type] = text/html
  • Dr. Pollett suggested I change the name of the basic_summarizer.php file to scrape_summarize.php
  • Dr. Pollett suggested I change the base_summarizer.php file to just summarizer.php
  • Dr. Pollett suggested I change the regexes to handle new lines
  • Dr. Pollett suggested I change the regex not to match on string like: <h1>ssfsd</h1>do not match this<h1>ssfsd</h1>
  • Dr. Pollett wants me to calculate all weights then and use the formula: sum of weights times frequency in that tag to get the additional weight
  • My Queue server is not starting and I need to check starting the queue server from the command line

Apr 8, 2015

  • We talked about the paper I found and the weighting algorithm we are going to implement
  • We plan on using 6 categories which was described in the paper
  • I will do some regression testing
  • The plan is to start each category at 0 and increase each value until we do not see an improvement
  • Then repeat by setting the each to the base and increment one category at a time
  • After we get the results, I will try to use those values in combination
  • We discussed how to organize the summarizers to include the weighting algorithm
  • Since the weighting algorithm will be common to all summarizers that use term frequencies I will move the summarizers to their own folder
  • The basic summarizer is not in its own file so I will find the basic code and make its own
  • Then I will create a base class that has weighting method(s)
  • I need to update the few places where it loads these files by searching for the require lines
  • We also discussed a long term goal of adding ROUGE to Yioop
  • The goal would to get Yioop to participate text retrieval contests and make a paper out of it for the Text REtrieval Conference (TREC)

Apr 1, 2015

  • We took a look at the graph-based summarizer I wrote
  • Dr. Pollett did a few tests, made some minor modifications and committed the new code to the repository
  • We discussed my Deliverable 3 post
  • Dr. Pollett made a suggestion to add a for example sentence in the conclusion about the BS and CBS ROUGE results
  • We went over the two page rank fixes to some known issues
  • Issue one was when all web pages pointing to one and that web page does not point out to other web sites
  • Issue two was when you have a disconnected cluster of web pages
  • After our discussion, we determined that we did not suffer from those issues and chose not to include the fixes
  • We discussed what I would do for Deliverable 4
  • We looked into the current centroid code to see where I need to make the changes
  • I believe I know where the changes need to be made
  • I have to make sure I do not slow down the algorithm, as it is already pretty slow
  • The idea is to add weights to certain terms depending on where they are in the html document
  • For example increase the word frequency for a word in a h1 tag by 3
  • I will do some research to see if there is a method to this madness solved already
  • I will check to see what are the most important spots in a html doc by tag type
  • I may try to weight by where the word appears; beginning, middle or end
  • I could also try to use the sentence length to add weight
  • The goal is to increase the frequency of those words thus increasing the weight
  • I can also look at Fall 2012 September 12 slide 4 of Dr. Pollett's CS 267 class

Mar 25, 2015

  • No meeting ... Spring Break

Mar 18, 2015

  • I completed most of the coding on the Graph Based Summarizer
  • We discussed the sentence rank algorithm part
  • After the discussion we had the code needed to complete the algorithm
  • In order to get the new Graph Based summarizer into Yioop we went over how to search the repository
  • I will use the code_tool.php script to help with that
  • For example, to find where the centroid summarizer is hooked in, search using "php code_tool.php search .. centroid" from the bin directory of the code base
  • We also discussed that after I get it into Yioop, I will perform the same summary tests as in Deliverable 1

Mar 11, 2015

  • We reviewed the presentation I made for the graph based algorithm I am working on
  • We went over the page rank algorithm and adjacency matrix portion thoroughly
  • Dr. Pollett recommended I look at Who's #1 and Page Rank and Beyond by Amy Langville and Carl Meyer
  • We discussed the code I will be writing
  • I will leverage an existing stop words, sentence splitter and punctuation methods in the PhraseParser
  • There is also a frequency matrix creator that I may use in the PhraseParser

May 6, 2015

  • We went over the patch I created
  • I added the ability to output the summarizer results to a file
  • Dr. Pollett suggested I move the Boolean flag and output folder variable to the top of the file as constants
  • Dr. Pollett also scanned through the code suggested I change my camel case variable names to multi-word variable names with an underscore between each word
  • We discussed my CS297 final report
  • Dr. Pollett will review it and give me some feedback next week
  • At that point, I will update the report resubmit it
  • We discussed what is needed to begin CS299
  • By next week I need to have solicited instructors to be on my CS 299 committee together
  • I need 2 instructors
  • I also need to fill out the 2 page CS299 proposal template
  • I will use the last paragraph of my final report for CS299 and bullet points

Apr 29, 2015

  • We reviewed the regex again because it was not matching multiples
  • We decided to add parenthesis to the inside of the anchor tags; (?s)<a.*?>(.*?the.*?)<\/a> and then use substr_count() function to get an accurate frequency count
  • I will update the code and the test cases to match
  • We talked about the CS 297 report and the CS 299 topic
  • I need to use the following format
  • A one page intro
  • A paragraph describing each section
  • Trim what is posted to two pages for each deliverable
  • and a one page conclusion that encompasses what will be done in CS 299
  • For CS 299, we want to keep improving on summarizers
  • We want to do experiments that will generate a paper
  • The key will be to get the paper done well before graduation
  • We want to compare a few different summarizer methods one of which is different than any that others that have been looked at
  • For example, a graph summarizer, the centroid summarizer with weights and look into the one the teenager sold to Yahoo
  • We also want to do more extensive experiments like using millions of documents to basically use a larger data set

Apr 22, 2015

  • We discussed and confirmed that I had the correct regular expression to find matching words in the 6 html tags we plan on using for weighting
  • We talked about the weighting algorithm in detail because it does not seem to be contributing to new summary results
  • I convinced Dr. Pollett that it I am using the weighting algorithm correctly for the graph based summarizer
  • Since the weighting algorithm has not produced new results Dr. Pollett is suggesting two experiments
  • To make the summaries shorter for example, 500 bytes
  • Use the summaries Mangesh Dahale used in his research in order to get a set with more html tags
  • Make some unit tests for the weighting algorithm to make sure it is producing the correct output

Apr 15, 2015

  • We looked into how to automate web site input to generate summaries for the regression testing
  • Dr. Pollet pointed me to the crawl_component.php file to look at the pageOptions method
  • $_REQUEST[option_type] value has to be test_options
  • $_REQUEST[TESTPAGE] value has to be the data (summary text) to test
  • $_REQUEST[page_type] = text/html
  • Dr. Pollett suggested I change the name of the basic_summarizer.php file to scrape_summarize.php
  • Dr. Pollett suggested I change the base_summarizer.php file to just summarizer.php
  • Dr. Pollett suggested I change the regexes to handle new lines
  • Dr. Pollett suggested I change the regex not to match on string like: <h1>ssfsd</h1>do not match this<h1>ssfsd</h1>
  • Dr. Pollett wants me to calculate all weights then and use the formula: sum of weights times frequency in that tag to get the additional weight
  • My Queue server is not starting and I need to check starting the queue server from the command line

Apr 8, 2015

  • We talked about the paper I found and the weighting algorithm we are going to implement
  • We plan on using 6 categories which was described in the paper
  • I will do some regression testing
  • The plan is to start each category at 0 and increase each value until we do not see an improvement
  • Then repeat by setting the each to the base and increment one category at a time
  • After we get the results, I will try to use those values in combination
  • We discussed how to organize the summarizers to include the weighting algorithm
  • Since the weighting algorithm will be common to all summarizers that use term frequencies I will move the summarizers to their own folder
  • The basic summarizer is not in its own file so I will find the basic code and make its own
  • Then I will create a base class that has weighting method(s)
  • I need to update the few places where it loads these files by searching for the require lines
  • We also discussed a long term goal of adding ROUGE to Yioop
  • The goal would to get Yioop to participate text retrieval contests and make a paper out of it for the Text REtrieval Conference (TREC)

Apr 1, 2015

  • We took a look at the graph-based summarizer I wrote
  • Dr. Pollett did a few tests, made some minor modifications and committed the new code to the repository
  • We discussed my Deliverable 3 post
  • Dr. Pollett made a suggestion to add a for example sentence in the conclusion about the BS and CBS ROUGE results
  • We went over the two page rank fixes to some known issues
  • Issue one was when all web pages pointing to one and that web page does not point out to other web sites
  • Issue two was when you have a disconnected cluster of web pages
  • After our discussion, we determined that we did not suffer from those issues and chose not to include the fixes
  • We discussed what I would do for Deliverable 4
  • We looked into the current centroid code to see where I need to make the changes
  • I believe I know where the changes need to be made
  • I have to make sure I do not slow down the algorithm, as it is already pretty slow
  • The idea is to add weights to certain terms depending on where they are in the html document
  • For example increase the word frequency for a word in a h1 tag by 3
  • I will do some research to see if there is a method to this madness solved already
  • I will check to see what are the most important spots in a html doc by tag type
  • I may try to weight by where the word appears; beginning, middle or end
  • I could also try to use the sentence length to add weight
  • The goal is to increase the frequency of those words thus increasing the weight
  • I can also look at Fall 2012 September 12 slide 4 of Dr. Pollett's CS 267 class

Mar 25, 2015

  • No meeting ... Spring Break

Mar 18, 2015

  • I completed most of the coding on the Graph Based Summarizer
  • We discussed the sentence rank algorithm part
  • After the discussion we had the code needed to complete the algorithm
  • In order to get the new Graph Based summarizer into Yioop we went over how to search the repository
  • I will use the code_tool.php script to help with that
  • For example, to find where the centroid summarizer is hooked in, search using "php code_tool.php search .. centroid" from the bin directory of the code base
  • We also discussed that after I get it into Yioop, I will perform the same summary tests as in Deliverable 1

Mar 11, 2015

  • We reviewed the presentation I made for the graph based algorithm I am working on
  • We went over the page rank algorithm and adjacency matrix portion thoroughly
  • Dr. Pollett recommended I look at Who's #1 and Page Rank and Beyond by Amy Langville and Carl Meyer
  • We discussed the code I will be writing
  • I will leverage an existing stop words, sentence splitter and punctuation methods in the PhraseParser
  • There is also a frequency matrix creator that I may use in the PhraseParser

Mar 6, 2015

  • Dr. Pollett helped me generate the patch for the Dutch stemmer I wrote
  • I updated the bug/issue I opened to include the patch
  • Dr. Pollett gave the code his initial okay
  • I was reminded that I need to complete the contributor agreement
  • After creating the first patch, we looked at what needed to be changed to have nl (Nederlands) as a locale as a default option
  • I will edit the createdb.php file and the config.php file with the appropriate changes
  • I will generate another patch and post it to the bug/issue
  • We discussed the new deliverable
  • I need to use Google Scholar to find papers that have cited "The Automatic Creation of Literature Abstracts" to get better references
  • After finding some new references, I will create a presentation (at most 8 slides) that summarizes the paper in order to demonstrate my understanding
  • We reviewed my deliverable #2 posting
  • Dr. Pollett gave me a few suggestions for referencing websites for better SEO
  • Dr. Pollett also suggested I have an ending period on my references section

Feb 25, 2015

  • We looked at my configuration.ini file for my Nl locale again
  • I was instructed to remove all single quotes from the file
  • I was instructed to remove all of the backslashes
  • We set my locale to Nl on my instance of the Yioop search engine and tested my locale
  • During the test we noticed some of the converted strings needed attention
  • We ran a search and it errored because of the missing segment() method which I will fix
  • We discussed entering an issue into the Mantis site
  • I created an account and Dr. Pollett upgraded me so I can post patches
  • I will follow the instructions for making a patch
  • We looked at validating the html files I have been posting i.e bio, proposal, etc
  • I will remove all of the validation errors on all existing pages
  • I will continue to validate any page I add or update
  • I will email Dr. Pollett the paper I found for deliverable #3

Feb 18, 2015:

  • We discussed the work I have done on the stemmer so far
  • We discussed that I need to add unit tests based on the current Yioop standards
  • We looked at my locale configuration.ini file
  • The Yioop UI is not showing all of them translated
  • I have to quote every string and see if the Yioop UI shows all of the settings
  • We discussed about the Yioop contributor agreement
  • I will read it and sign or not sign it
  • If I do sign it, I will be able to upload my code for Dr. Pollett to review and post to the production release
  • If I do not sign it I will not be able to get any code I write into the production release

Feb 11, 2015:

  • We reviewed my ROUGE configuration and what it took to get meaningful results
  • We reviewed my results for deliverable 1; particularly the ROUGE results
  • We discussed what would be due for each deliverable
  • We discussed what needs to be worked on for the next deliverable
  • We looked at what code needs to manipulated within the Yioop search engine to incoroporate the new stemmer
  • We discussed how to create a new locale with the Yioop search engine
  • We discussed to get some sample code from snowball.tartarus.org

Feb 4, 2015:

  • We got this site up and we went over how to view/modify its items
  • We decided on a title for the project
  • I entered my meeting time in the correct place on the wiki page
  • We reviewed the work I have done for my deliverable next week
  • We discussed the correct approach for handling my CS297 and CS299