Project Blog:

May 17, 2016

We discussed that I will be getting my paper friday or monday
I will get the corrections in as soon as I can
Dr. Pollett wants to extend Yioop to support ROUGE
I showed Dr. Pollett how to run ROUGE
I will send Dr. Pollett all of my rouge runner scripts and my rouge input files
I will send Dr. Pollett the C# code for my rouge file prepper
I successfully completed my defense
This is the last meeting of the semester

May 10, 2016

I discussed that I still have not heard from GAPE, I will contact them this week
We went over the presentation
We cut some slides
I need to move the overviews down to the its experiment area
I moved the Yioop overview slide up
we removed the future work bullet
I need to have a stronger conclusion
On the graph based formula slide try to elaborate on what the sum actually is

May 03, 2016

We discussed that I need to submit my paper to turnitin.com
The CS masters website has the information on it in the Project Guidelines page
We discussed that I have not heard from GAPE
We talked about scheduling my defense
I will send my report to my commitee and ask them for good times suggesting May 17th @ 1000
I then need to fill out the oral defense request form
It must have two different dates and times
Not two times on the same day
I will get my presentation completed

April 26, 2016

We discussed the overall goal of my thesis
The goal for summarizing is to get the best summary
My project was to look at the existing ones and see what is the best approach
we tried things that had some effect and did not
We discussed the defense presentation
I should talk about each experiment briefly one slide overview
I should talk about the background
I should Yioop!'s search engine
I should talk about the types of summarizers
I should talk about ROUGE in general
I should talk about what I did in detail coding
I should talk about the experiments I did
Then conclusion
I should be ready to talk for 35 minutes and then 15 minutes of questions

April 19, 2016

No meeting this week

April 12, 2016

I turned in my paper to the English department, so there was nothing to discuss this week

April 05, 2016

Dr. Pollett did a quick review of my paper
He suggested in the first paragraph of chapter 2 change Dr. Pollett to Dr. Christopher Pollett at San Jose State University
He looked at the quotes and noticed I need to fix the start quotes to be two backticks
He suggested I change Dr. Pollett's algorithm name to something about the average sentences that it compares
He suggested I check that all Dr. do not have two spaces after it
I will read the entire SJSU thesis guidelines
We discussed what is left to write
I still need to write the abtract, acknowledgement, introduction and conclusion
Dr. Pollett will review and send me his changes by early Friday
We looked at the forms we need to submit
I need to get the approval form to each committee member

March 29, 2016

No meeting ... Spring Break

March 22, 2016

Dr. Pollett suggested changing the chapter titles to make it read like a story
The subheadings should be changed to reflect the original title
For example, chapter 2s title would be; Understanding the Proprocessing before Summarization
The aim would be; Stemming Using a Dutch Stemmer
The overview would be; An overview of how Stemmers Work
The work performed; Writing and testing a Dutch stemmer or Implementing a Dutch Stemmer
The results would be; Experiments with a Dutch Stemmer
I need to swap chapter 2 and 3
The title of the paper will be; Experiments with and Implementation of a Context Sensitive Text Summarizer
We discussed that I will be adding more content where necessary to fill in any gaps
I will include result files and configurations in the appendix
Update the Human Generated Input and System Generated Input to be ROUGE Human Generated Input and ROUGE System Generated Input
We discussed the defense
We want to get it done early so we can stay away from the end of the semester rush

March 15, 2016

We talked about the content I posted for the sentence compression experiment
Dr. Pollett was okay with all but one sentence and he suggested I reword it
We talked about how we will organize my final paper
I will have to come up with an abstract
I will have a one to two page introduction
The section header will be; Intro, Background (consisting of an overview, summarization, ROUGE, stemming, CS 297 work etcetera), Experiments, Integrations and a 2 page conclusion
I will work on putting the paper together and Dr. Pollett will review it next week

March 08, 2016

We talked about the sentence compression
Dr. Pollett checked in the code I wrote for sentence compression
We discussed about how Yioop stores phrases may have caused the sentence to not be as successful as it could have been
At this point we are ready to write the paper
The goal of the paper should be we found some approaches that worked and some things that did not work
I will start writing the paper in LaTex so I do not run into problems at the end
I will look at SJSU scholar works to see more examples

March 01, 2016

We discussed the the sentence compression method I implemented as described on 2/16/2016 using 4 items from the Back to Basics: CLASSY 2006 paper
Dr. Pollett suggested I change some of my multi-word variable names to conform to his standard
He also noted that the TextProcessor.php file needs to be updated so it uses the sentence compression algorithm
While looking at the TextProcessor.php file we noticed that it is only using the basic summarizer
I will update the TextProcessor.php file to utilize all of the summarizers just like the HtmlProcessor.php file does
Next I need to run the ROUGE tests, compare the results and write it up
I also need to create a patch as soon as all of my code is completed

February 23, 2016

We discussed my progress in general this week
I was not able to make any progress because of work I was doing for my CS 280 class
We agreed that I will take a few weeks off from CS 280 so we can get the rest of the CS 299 work completed

February 16, 2016

We discussed the resources I read about sentence compression
I found 8 papers on sentence compression
Most of them were too complex for what we are trying to do
1 of them was simple and I will implement it
My changes will occur as follows
In the en-us tokenizer put a sentence compression method
add a static reference to compressing the sentence that calls the current sentence compression method in the tokenizer
call it PhraseParser::sentenceCompression in each summarizer after each sentence would be added

February 09, 2016

We discussed the state of deliverable #3 and decided it was a bust
I will document our experiments with the lanczos algorithm over the next few weeks
We found that the English department deadline for the master's thesis April 8th
I will Start writing the thesis by March 11th
We discussed my last deliverable and it will be on sentence compression
I will search the Nenkova paper for anything that has been done with sentence compressiion
Next week go over what I read and finalize what we are going to do for sentence compression
We are going to experiment with and without sentence compression
I contacted Shawn Tice by email
He gave me some solid feedback that I can use on the thesis paper
Next weekl we will fill out the candidacy form and submit it

December 16, 2015

We discussed my CS280 Proposal
We made a few edits and completed it
I will mail it to the CS department tomorrow
We discussed the Lanczos algorithm further
We confirmed what we are already doing wrong
We looked into the alternative method
We need to get more clarification from the original paper written by Paige
I will see if I can find the paper or check it out in the library

December 11, 2015

We discussed the Hachey and Gong paper
Neither of them help us with a better method to calculate the SVD
Dr. Pollett has a book with the another method in it
Dr. Pollett will scan the section and email it to me
We will meet again after I read the scanned items Dr. Pollett sends me

December 07, 2015

We went over my findings with the current Lanczos algorithm I converted
I found that there is a point were the repeated dot product calculation produces values of infinity
These results destroy the process to summarize the content
Dr. Pollett and I read that this algorithm can run into this problem and this algorithm is indeed faithful to its flaw
We decided not to pursue any farther with it
We will attempt to implement an approach (Hachey) that does not suffer from its know failure or move on and call Lanczos a bust
We talked about what needs to be done for CS280
We wrote up a summary so I can write a proposal and turn it into the department
We wrote my paragraph summarizing the semester

November 30, 2015

I demonstrated the Lanczos algorithm PHP code I converted from Java
I showed that I can produce the same result from the same input
We also tried to run the original summarizer on some arbitrary text
It did not run as expected
We were getting NaN results and I am going to look into it
We reviewed my latest patch and Dr. Pollett committed the code to the global repository
We talked about next semester
I need to post my deliverables
I need to write a conclusion of this semester
Next week we will come up with the CS280 summary for the proposal submittal on the CMS Detector GUI topic

November 23, 2015

I sent the drafts of the deliverables to Dr. Pollett
He took a quick look at my papers and said they are okay
I had questions about the QR decompose method that was being used in the Lanczos algorithm I was converting
Dr. Pollett explained the QR decompose to me and it made sense why my conversion of the code was failing
Dr. Pollett recommends padding the triangular array with 0s below the diagonal
We are near the end of the semester so we will spend time discussing what is next in our next meeting
I will continue to work on converting the Lanczos algorithm and hopefully complete the conversion

November 16, 2015

We discussed my current status on writing the Lanczos algorithm
I had converted the code and are troubleshooting a bug
We troubleshot a little with no success
We discussed the write-up for deliverable 1 and 2
This week I will proofread what I have wrote and send them to Dr. Pollett for review

November 09, 2015

We discussed the current status of the thesis
I completed the updates to the patch I submitted
I started writing the PHP code for implementing the Lanczos algorithm

November 02, 2015

We discussed the current status of the thesis
I still need to make updates to the patch I submitted
I wrote a rough draft for deliverable 2
I looked into the the Lanczos algorithm and understand what needs to be done

October 26, 2015

We discussed the current status of the thesis
I need to make updates to the patch I submitted
I need to resave the test input files because they have strange unicode characters in them
I need to remove the dependency on the simple_dom class
I need to write a rough draft for deliverable 2
For Deliverable 3, we are going to choose to implement the Lanczos algorithm instead

October 26, 2015

We discussed the current status of the thesis
I need to make updates to the patch I submitted
I need to resave the test input files because they have strange unicode characters in them
I need to remove the dependency on the simple_dom class
I need to write a rough draft for deliverable 2
For Deliverable 3, we are going to choose to implement the Lanczos algorithm instead

October 19, 2015

No meeting this week

October 12, 2015

We looked at the html processor and I showed that all summerizers are using the detector
We looked that ROUGE results for to compare against the non detected scores
We looked the DUC scores and noticed the graph one had the best numbers for the ROUGE 1 test
Next I will create a patch and write up the documentation

October 05, 2015

We discussed if I can create another CMS detector
I will try to create a CMS detector for Drupal
We discussed the work I have done to figure out the DUC document ROUGE tests
I will finish the code to get the summaries in order and run the DUC document ROUGE tests for each summarizer
We discussed how we can quantify and understand why the weighting is not affecting the results
Up to this point, tag weighting or learning weights is not going to buy us anything
We may try looking into weighting based on parts of speech
We think we know what we can quantify
Scores do not increase with weighting but went way up with the detector because summarizing the important content is the silver bullet
The above is quantifiable
We discussed the coding aspect of the detectors
Dr. Pollett feels manually coding a detector is fine but can code learn a CMS detection system
Can we figure out a way to setup such a system so that we do not have to code any further CMS detectors
If I have time, I would review some Drupal and Wordpress sites to see if you can find the tags that are in the content div and count them

September 28, 2015

I have completed the first CMS detector
It is for Wordpress because the documents we are using for our sample just happen to be built by Wordpress
I have also written tests for the Wordpress CMS detector
The test checks the headers of 8 files to see if they are Wordpress or not
The files do not have any content in the body only the headers
The tests were difficult because I had to configure composer and PhantomJS
We discussed how CMS framework detectors will be enabled
I will change the CMS detector files to have a distinct naming convention
Each detector will be loaded that it in the directory based on that distinct file name
We discussed that each detector will have a method that will return the important section of the html to remove extraneous data from being summarized
We discussed that each detector will add additional weights to content that is in certain tags based on the work we did last semester
The only caveat is that I will remove the a tag from the query, then weight each in increments of 1 for starters
I figured out how the DUC documents are constructed
There are a cluster of documents that are summarized as a whole
Then the first 100 words are extracted for the summary used in the ROUGE tests
When I do the DUC ROUGE test, I will remove the summaries from the other contestants and add ours
To get to the ROUGE tests, code needs to be written so I can run our summarizer against the data
The cluster of documents are xml files with a .scu extension
I wrote code to convert the .scu files to a flat text file
Dr. Pollett suggested I change my uppercase boolean values to lower case

September 21, 2015

We discussed the ROUGE results of the new Weighted Centroid Summarizer (WCS)
The results were better than my Graph Based Summarizer (GBS) but not the Centroid Based Summarizer (CBS) of Basic Summarizer (BS)
We discussed the next steps for the WCS
I will work on a framework that can detect a web pages framework
For example, a site could be built using Wordpress or Drupal etc
Then we would check to see where each framework puts its most important content and weight that content higher
Dr. Pollett suggested I start looking for where the getPages() method is called and insert the code there and pass its detection downstream
After the work is completed I will rerun the ROUGE tests
If I get time I will look more into the DUC data

September 14, 2015

We reviewed the code I wrote up to this point
Dr. Pollett feels the code is solid and now we want to work on adding weights
Dr. Pollett would like me to write code to detect a page's framework
For example, detect if a page was deploy via wikipedia or wordpress
After the framework is detected we would do searches in the page source
External configurations would be needed, because certain tags would weigh more for different frameworks
Each tag of importance would then get an increaseed weight
This would also help gather statistics about what percentage of web sites are really wordpress or wikipedia etcetera
Since the code is good up to this point, I will run the rouge tests for the weighted summarizer and see how the results compare to the other Yioop summarizers
Lastly, we looked more at te DUC data
We still have not found a good way to automate the extraction of the content and their summaries
We did find that there is some correlation between the file names of the summary and the folder of the content

September 09, 2015

We discussed what was left to weighted centroid summarizer
We looked deep into the current summarizer because Dr. Pollett thought I redid what was already done in the current centroid summarizer
After an in depth look at the current centroid summarizer, we found out that it was not what I had written
It was something similar using Inverse Document Frequency (IDF) and cosine similarity
I am to create a vector of frequencies in each sentence and then calculate how close each sentence resembles the average sentence
We got the DUC documents
We looked at the documents briefly and chose to look at them more closely during the next meeting

August 31, 2015

Discussed next weeks meeting day because Monday September 7 is Labor Day
We will meet on Wed next week instead
We discussed getting access to the AQUAINT, TIPSTER and TREC datasets
Dr. Pollett ran it by the chair and he was good with it
Dr. Pollett filled out the forms
I will scan them when I get home and email them to Dr. Pollett
We discussed the Yioop implementation of Composer
Yioop uses it to keep its libraries current
I will be referencing it configuration when I add code to the Yioop repository
Since we have to wait for the datasets to be provided, we discussed working on weighting the centroid based algorithm
I will create a new WeightedCentroidBased summarizer
I need to create or verify:
There is a vector where each row in the vector (associative array) is the frequencies of the words in that sentence
There is a normalized vector and its goal is to have each value be between zero and one and add up to 1
To get it normalized we divide the original by the square root of the sum of the values squared for every row (sentence)
There is an average sentence vector which add the columns and divide by the number of rows to get the average sentence

August 25, 2015

We confirmed our meeting time to be Monday's at 1200
We discussed that I need to download the latest Yioop git repository
I need to check http://www.seekquarry.com/p/Composer to read about the new composer Dr. Pollett Implemented
I need to start on deliverable 1
I need to find a large document set
I was thinking of using the DUC 2002 data set

May 13, 2015

We went over my report
I need to fix some typos, reword a few things, rewrite the first three sentences of the introduction and resubmit it
We discussed that I have confirmed my two instructors to be on my CS299 committee
We looked over my CS299 proposal
Dr. Pollett put his rubber stamp on it
We also talked about the summly summarizing algorithm and that we may not be able to get a hold of it
I need to start filling out the CS299 paperwork and put it in Dr. Pollett's box to get the proper signatures

May 6, 2015

We went over the patch I created
I added the ability to output the summarizer results to a file
Dr. Pollett suggested I move the Boolean flag and output folder variable to the top of the file as constants
Dr. Pollett also scanned through the code suggested I change my camel case variable names to multi-word variable names with an underscore between each word
We discussed my CS297 final report
Dr. Pollett will review it and give me some feedback next week
At that point, I will update the report resubmit it
We discussed what is needed to begin CS299
By next week I need to have solicited instructors to be on my CS 299 committee together
I need 2 instructors
I also need to fill out the 2 page CS299 proposal template
I will use the last paragraph of my final report for CS299 and bullet points

Apr 29, 2015

We reviewed the regex again because it was not matching multiples
We decided to add parenthesis to the inside of the anchor tags; (?s)<a.*?>(.*?the.*?)<\/a> and then use substr_count() function to get an accurate frequency count
I will update the code and the test cases to match
We talked about the CS 297 report and the CS 299 topic
I need to use the following format
A one page intro
A paragraph describing each section
Trim what is posted to two pages for each deliverable
and a one page conclusion that encompasses what will be done in CS 299
For CS 299, we want to keep improving on summarizers
We want to do experiments that will generate a paper
The key will be to get the paper done well before graduation
We want to compare a few different summarizer methods one of which is different than any that others that have been looked at
For example, a graph summarizer, the centroid summarizer with weights and look into the one the teenager sold to Yahoo
We also want to do more extensive experiments like using millions of documents to basically use a larger data set

Apr 22, 2015

We discussed and confirmed that I had the correct regular expression to find matching words in the 6 html tags we plan on using for weighting
We talked about the weighting algorithm in detail because it does not seem to be contributing to new summary results
I convinced Dr. Pollett that it I am using the weighting algorithm correctly for the graph based summarizer
Since the weighting algorithm has not produced new results Dr. Pollett is suggesting two experiments
To make the summaries shorter for example, 500 bytes
Use the summaries Mangesh Dahale used in his research in order to get a set with more html tags
Make some unit tests for the weighting algorithm to make sure it is producing the correct output

Apr 15, 2015

We looked into how to automate web site input to generate summaries for the regression testing
Dr. Pollet pointed me to the crawl_component.php file to look at the pageOptions method
$_REQUEST[option_type] value has to be test_options
$_REQUEST[TESTPAGE] value has to be the data (summary text) to test
$_REQUEST[page_type] = text/html
Dr. Pollett suggested I change the name of the basic_summarizer.php file to scrape_summarize.php
Dr. Pollett suggested I change the base_summarizer.php file to just summarizer.php
Dr. Pollett suggested I change the regexes to handle new lines
Dr. Pollett suggested I change the regex not to match on string like: <h1>ssfsd</h1>do not match this<h1>ssfsd</h1>
Dr. Pollett wants me to calculate all weights then and use the formula: sum of weights times frequency in that tag to get the additional weight
My Queue server is not starting and I need to check starting the queue server from the command line

Apr 8, 2015

We talked about the paper I found and the weighting algorithm we are going to implement
We plan on using 6 categories which was described in the paper
I will do some regression testing
The plan is to start each category at 0 and increase each value until we do not see an improvement
Then repeat by setting the each to the base and increment one category at a time
After we get the results, I will try to use those values in combination
We discussed how to organize the summarizers to include the weighting algorithm
Since the weighting algorithm will be common to all summarizers that use term frequencies I will move the summarizers to their own folder
The basic summarizer is not in its own file so I will find the basic code and make its own
Then I will create a base class that has weighting method(s)
I need to update the few places where it loads these files by searching for the require lines
We also discussed a long term goal of adding ROUGE to Yioop
The goal would to get Yioop to participate text retrieval contests and make a paper out of it for the Text REtrieval Conference (TREC)

Apr 1, 2015

We took a look at the graph-based summarizer I wrote
Dr. Pollett did a few tests, made some minor modifications and committed the new code to the repository
We discussed my Deliverable 3 post
Dr. Pollett made a suggestion to add a for example sentence in the conclusion about the BS and CBS ROUGE results
We went over the two page rank fixes to some known issues
Issue one was when all web pages pointing to one and that web page does not point out to other web sites
Issue two was when you have a disconnected cluster of web pages
After our discussion, we determined that we did not suffer from those issues and chose not to include the fixes
We discussed what I would do for Deliverable 4
We looked into the current centroid code to see where I need to make the changes
I believe I know where the changes need to be made
I have to make sure I do not slow down the algorithm, as it is already pretty slow
The idea is to add weights to certain terms depending on where they are in the html document
For example increase the word frequency for a word in a h1 tag by 3
I will do some research to see if there is a method to this madness solved already
I will check to see what are the most important spots in a html doc by tag type
I may try to weight by where the word appears; beginning, middle or end
I could also try to use the sentence length to add weight
The goal is to increase the frequency of those words thus increasing the weight
I can also look at Fall 2012 September 12 slide 4 of Dr. Pollett's CS 267 class

Mar 25, 2015

No meeting ... Spring Break

Mar 18, 2015

I completed most of the coding on the Graph Based Summarizer
We discussed the sentence rank algorithm part
After the discussion we had the code needed to complete the algorithm
In order to get the new Graph Based summarizer into Yioop we went over how to search the repository
I will use the code_tool.php script to help with that
For example, to find where the centroid summarizer is hooked in, search using "php code_tool.php search .. centroid" from the bin directory of the code base
We also discussed that after I get it into Yioop, I will perform the same summary tests as in Deliverable 1

Mar 11, 2015

We reviewed the presentation I made for the graph based algorithm I am working on
We went over the page rank algorithm and adjacency matrix portion thoroughly
Dr. Pollett recommended I look at Who's #1 and Page Rank and Beyond by Amy Langville and Carl Meyer
We discussed the code I will be writing
I will leverage an existing stop words, sentence splitter and punctuation methods in the PhraseParser
There is also a frequency matrix creator that I may use in the PhraseParser

May 6, 2015

We went over the patch I created
I added the ability to output the summarizer results to a file
Dr. Pollett suggested I move the Boolean flag and output folder variable to the top of the file as constants
Dr. Pollett also scanned through the code suggested I change my camel case variable names to multi-word variable names with an underscore between each word
We discussed my CS297 final report
Dr. Pollett will review it and give me some feedback next week
At that point, I will update the report resubmit it
We discussed what is needed to begin CS299
By next week I need to have solicited instructors to be on my CS 299 committee together
I need 2 instructors
I also need to fill out the 2 page CS299 proposal template
I will use the last paragraph of my final report for CS299 and bullet points

Apr 29, 2015

We reviewed the regex again because it was not matching multiples
We decided to add parenthesis to the inside of the anchor tags; (?s)<a.*?>(.*?the.*?)<\/a> and then use substr_count() function to get an accurate frequency count
I will update the code and the test cases to match
We talked about the CS 297 report and the CS 299 topic
I need to use the following format
A one page intro
A paragraph describing each section
Trim what is posted to two pages for each deliverable
and a one page conclusion that encompasses what will be done in CS 299
For CS 299, we want to keep improving on summarizers
We want to do experiments that will generate a paper
The key will be to get the paper done well before graduation
We want to compare a few different summarizer methods one of which is different than any that others that have been looked at
For example, a graph summarizer, the centroid summarizer with weights and look into the one the teenager sold to Yahoo
We also want to do more extensive experiments like using millions of documents to basically use a larger data set

Apr 22, 2015

We discussed and confirmed that I had the correct regular expression to find matching words in the 6 html tags we plan on using for weighting
We talked about the weighting algorithm in detail because it does not seem to be contributing to new summary results
I convinced Dr. Pollett that it I am using the weighting algorithm correctly for the graph based summarizer
Since the weighting algorithm has not produced new results Dr. Pollett is suggesting two experiments
To make the summaries shorter for example, 500 bytes
Use the summaries Mangesh Dahale used in his research in order to get a set with more html tags
Make some unit tests for the weighting algorithm to make sure it is producing the correct output

Apr 15, 2015

We looked into how to automate web site input to generate summaries for the regression testing
Dr. Pollet pointed me to the crawl_component.php file to look at the pageOptions method
$_REQUEST[option_type] value has to be test_options
$_REQUEST[TESTPAGE] value has to be the data (summary text) to test
$_REQUEST[page_type] = text/html
Dr. Pollett suggested I change the name of the basic_summarizer.php file to scrape_summarize.php
Dr. Pollett suggested I change the base_summarizer.php file to just summarizer.php
Dr. Pollett suggested I change the regexes to handle new lines
Dr. Pollett suggested I change the regex not to match on string like: <h1>ssfsd</h1>do not match this<h1>ssfsd</h1>
Dr. Pollett wants me to calculate all weights then and use the formula: sum of weights times frequency in that tag to get the additional weight
My Queue server is not starting and I need to check starting the queue server from the command line

Apr 8, 2015

We talked about the paper I found and the weighting algorithm we are going to implement
We plan on using 6 categories which was described in the paper
I will do some regression testing
The plan is to start each category at 0 and increase each value until we do not see an improvement
Then repeat by setting the each to the base and increment one category at a time
After we get the results, I will try to use those values in combination
We discussed how to organize the summarizers to include the weighting algorithm
Since the weighting algorithm will be common to all summarizers that use term frequencies I will move the summarizers to their own folder
The basic summarizer is not in its own file so I will find the basic code and make its own
Then I will create a base class that has weighting method(s)
I need to update the few places where it loads these files by searching for the require lines
We also discussed a long term goal of adding ROUGE to Yioop
The goal would to get Yioop to participate text retrieval contests and make a paper out of it for the Text REtrieval Conference (TREC)

Apr 1, 2015

We took a look at the graph-based summarizer I wrote
Dr. Pollett did a few tests, made some minor modifications and committed the new code to the repository
We discussed my Deliverable 3 post
Dr. Pollett made a suggestion to add a for example sentence in the conclusion about the BS and CBS ROUGE results
We went over the two page rank fixes to some known issues
Issue one was when all web pages pointing to one and that web page does not point out to other web sites
Issue two was when you have a disconnected cluster of web pages
After our discussion, we determined that we did not suffer from those issues and chose not to include the fixes
We discussed what I would do for Deliverable 4
We looked into the current centroid code to see where I need to make the changes
I believe I know where the changes need to be made
I have to make sure I do not slow down the algorithm, as it is already pretty slow
The idea is to add weights to certain terms depending on where they are in the html document
For example increase the word frequency for a word in a h1 tag by 3
I will do some research to see if there is a method to this madness solved already
I will check to see what are the most important spots in a html doc by tag type
I may try to weight by where the word appears; beginning, middle or end
I could also try to use the sentence length to add weight
The goal is to increase the frequency of those words thus increasing the weight
I can also look at Fall 2012 September 12 slide 4 of Dr. Pollett's CS 267 class

Mar 25, 2015

No meeting ... Spring Break

Mar 18, 2015

I completed most of the coding on the Graph Based Summarizer
We discussed the sentence rank algorithm part
After the discussion we had the code needed to complete the algorithm
In order to get the new Graph Based summarizer into Yioop we went over how to search the repository
I will use the code_tool.php script to help with that
For example, to find where the centroid summarizer is hooked in, search using "php code_tool.php search .. centroid" from the bin directory of the code base
We also discussed that after I get it into Yioop, I will perform the same summary tests as in Deliverable 1

Mar 11, 2015

We reviewed the presentation I made for the graph based algorithm I am working on
We went over the page rank algorithm and adjacency matrix portion thoroughly
Dr. Pollett recommended I look at Who's #1 and Page Rank and Beyond by Amy Langville and Carl Meyer
We discussed the code I will be writing
I will leverage an existing stop words, sentence splitter and punctuation methods in the PhraseParser
There is also a frequency matrix creator that I may use in the PhraseParser

Mar 6, 2015

Dr. Pollett helped me generate the patch for the Dutch stemmer I wrote
I updated the bug/issue I opened to include the patch
Dr. Pollett gave the code his initial okay
I was reminded that I need to complete the contributor agreement
After creating the first patch, we looked at what needed to be changed to have nl (Nederlands) as a locale as a default option
I will edit the createdb.php file and the config.php file with the appropriate changes
I will generate another patch and post it to the bug/issue
We discussed the new deliverable
I need to use Google Scholar to find papers that have cited "The Automatic Creation of Literature Abstracts" to get better references
After finding some new references, I will create a presentation (at most 8 slides) that summarizes the paper in order to demonstrate my understanding
We reviewed my deliverable #2 posting
Dr. Pollett gave me a few suggestions for referencing websites for better SEO
Dr. Pollett also suggested I have an ending period on my references section

Feb 25, 2015

We looked at my configuration.ini file for my Nl locale again
I was instructed to remove all single quotes from the file
I was instructed to remove all of the backslashes
We set my locale to Nl on my instance of the Yioop search engine and tested my locale
During the test we noticed some of the converted strings needed attention
We ran a search and it errored because of the missing segment() method which I will fix
We discussed entering an issue into the Mantis site
I created an account and Dr. Pollett upgraded me so I can post patches
I will follow the instructions for making a patch
We looked at validating the html files I have been posting i.e bio, proposal, etc
I will remove all of the validation errors on all existing pages
I will continue to validate any page I add or update
I will email Dr. Pollett the paper I found for deliverable #3

Feb 18, 2015:

We discussed the work I have done on the stemmer so far
We discussed that I need to add unit tests based on the current Yioop standards
We looked at my locale configuration.ini file
The Yioop UI is not showing all of them translated
I have to quote every string and see if the Yioop UI shows all of the settings
We discussed about the Yioop contributor agreement
I will read it and sign or not sign it
If I do sign it, I will be able to upload my code for Dr. Pollett to review and post to the production release
If I do not sign it I will not be able to get any code I write into the production release

Feb 11, 2015:

We reviewed my ROUGE configuration and what it took to get meaningful results
We reviewed my results for deliverable 1; particularly the ROUGE results
We discussed what would be due for each deliverable
We discussed what needs to be worked on for the next deliverable
We looked at what code needs to manipulated within the Yioop search engine to incoroporate the new stemmer
We discussed how to create a new locale with the Yioop search engine
We discussed to get some sample code from snowball.tartarus.org

Feb 4, 2015:

We got this site up and we went over how to view/modify its items
We decided on a title for the project
I entered my meeting time in the correct place on the wiki page
We reviewed the work I have done for my deliverable next week
We discussed the correct approach for handling my CS297 and CS299