CS297-298 Project News Feed
January 5th, 2012: Finish Page Ranking using BM25F equations.
December 6th, 2011: Fall graduation deadline is passed, concentrate on trying to finish it as early as possible for Spring Graduation.
November 29th, 2011: Try not to get sick and finish as much as possible, so that project is finished before term end and it is possible to schedule defense in 2nd or 3rd week of January.
November 22nd, 2011: Try to wrap up project in Thanks Giving Holidays.
November 15th, 2011: Implement BM25F Ranking algorithm. Prepare to start writing Report.
November 1st, 2011: Finish up with word Frequency, Inverted Index and Search. Correct word frequency concatenation related issues in index.
October 25th, 2011: Implement basic search from index which does not necessarily use page ranking in its basic step.
October 18th, 2011: Finish the stemmer and other parts of inverted index to finish everything by deadline.
October 11th, 2011: Use Snowball Porter Stemmer before parsing. Inverted Index with word frequency.
September 26th, 2011: Finish Inverted Index and try to merge results for various files.
September 20th, 2011: Decide whether to execute Page Ranking or Inverted Index first. Finish Second Pass of Page Ranking.
September 13th, 2011: Use JDBM for on disk B-Tree implementation.
September 6th, 2011: Continue work from last week.
August 30th, 2011: Build B-tree structure for page rank vector [hash of URL(8 bytes), page rank(8 bytes)] and URLs with outgoing links. On second pass, for each hash of URL in page rank vector, calculate new page rank based on outgoing links for that URL.
May 26th - August 23rd, 2011: Summer Vacation.
April 19th - May 17th, 2011: Work on pending issues with generating outgoing links and other pending issues.
April 12th, 2011: Try to fix memory issues with storage of inverted index.
April 5th, 2011: Try to fix issues with storing inverted index.
March 22nd, 2011: Keep working on storing word ids, doc ids and see page ranking algorithm.
March 15th, 2011: Store word ids, doc ids and see page ranking algorithm.
March 8th, 2011: Work forward with html parser and use title, h1-h6, first div, first para and anchor tags. Also see Divergence-From-Randomness algorithm for page ranking.
March 1st, 2011: Figure out issue with reading html pages with ARCReader and move forward with word table, doc table.
February 22nd, 2011: Continue working with GZIP Reading for ARC files.
February 15th, 2011: Troubleshoot GZIP Reading Module for ARC files.
February 8th, 2011: Continue on ARC reading and develop iterator and processor to extract words.
February 1st, 2011: Create a new project and start reading ARC files.
November 30th, 2010: Prepare Report.
November 23rd, 2010: Try to run NutchWAX and start writing draft copy of report.
November 16th, 2010: Continue working on NutchWAX and prepare presentation for inverted indexing.
November 9th, 2010: Continue work from last week and learn about basic indexing.
October 26th, 2010: Continue experimenting with NutchWAX and WERA and try to find out information about indexing format nutch uses.
October 19th, 2010: See capabilities of NutchWAX and WERA.
October 12th, 2010: See capabilities of NutchWAX.
October 5th, 2010: Continue with Editing crawler to store offset of robots.txt for the page.
September 28th, 2010: Edit crawler to store offset of robots.txt for the page.
September 21st, 2010: See capabilities of ARCReader and do profiling of heritrix.
September 14th, 2010: Experiment more with heritrix.
September 7th, 2010: Find a way to view arc files and find a way to have it work in distributed environment.
August 31st, 2010: Decided to prepare a powerpoint slide show demoing how i built source code for heritrix and show a simple crawl result for next meeting.