Chris Pollett > Students >
Amith

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Del 1: OPIC Algorithm implementation]

    [Del 2: SALSA Algorithm & Nutch]

    [Del 3: Nutch implementation]

    [Del 4: HITS Algorithm implementation]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Final Report - PDF]

    [CS298 Project Source Code - ZIP]

                          

























CS297 Proposal

An Online version of HITS-based search engine

Amith Kollam Chandranna (amithkc@gmail.com)

Advisor: Dr. Chris Pollett

Description:

This CS297 proposal aims at implementing an efficient and fast "Online" HITS-based search engine. HITS is short for "Hyperlink-Induced Topic Search" link analysis algorithm. It is also known as "Hubs and authorities". The original HITS algorithm requires the crawl to be done first and later the scores of HUB and Authority are calculated. But in this project, we plan to implement the online score calculation (HUB and Authority) i.e., the scores are calculated at the time of crawling. We would also like to understand OPIC (On-line Page Importance Calculation) algorithm. This is because OPIC implements an online score calculation but for "PageRank" algorithm. This understanding would be very helpful in this project. The project is proposed to be implemented in PHP and MySQL. Analysis of other existing search engines (like Nutch) would be carried out to gain a good understanding of existing search techniques. This understanding would be very helpful in developing more efficient algorithms for retrieving, storing data. Also, it is proposed to carry out performance comparisons (after implementing the project) with other search engines.

Schedule:

Week 1: Jan.25-31.Understanding the PageRank algorithm
Week 2: Feb.1-7.Understanding On-line Page Importance Calculation (OPIC) algorithm
Week 3: Feb.8-14.Understanding the HITS algorithm
Week 4: Feb.15-21.Understanding the HITS algorithm
Week 5: Feb.22-28.Understanding search engine implemented by Dr. Chris Pollett
Week 6: Mar.1-7.Understanding the implementation details of other search engines (Nutch, etc)
Week 7: Mar.8-14.Researching ways to efficiently traverse web pages
Week 8: Mar.15-21.Researching ways to efficiently and quickly calculating on-line web page scores
Week 9: Mar.22-28.Deciding efficient framework to implement these in PHP and MySQL
Week 10: Mar.29-Apr.4.Setting up the required development/test environment
Week 11: Apr.5-11.Researching the algorithms that would could be implemented
Week 12: Apr.12-18.Researching the algorithms that would could be implemented
Week 13: Apr.19-25.Efficiency comparison with other search engines
Week 14: Apr.26-Mar.2.Develop a draft version of High-level Design document
Week 15: May.3-9.Incorporate comments (if any) in the design document
Week 16: May.10-16.Deciding the project schedule for CS298

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Understanding of the PageRank, OPIC and HITS algorithm.

2. Understanding the implementation of already existing search engines.

3. Deciding the algorithms that would be implemented in CS298.

4. High-level Design document.

5. Tentative project schedule for CS298.

References:

[Lieb 2009] The truth about search engine optimization. Rebecca Lieb. Upper Saddle River, N.J. : FT Press. 2009.

[Levitin 2007] Introduction to the design and analysis of algorithms. Anany Levitin. Boston : Pearson Addison-Wesley. 2007.

[Langville 2006] Google's PageRank and beyond : the science of search engine rankings. Amy N. Langville and Carl D. Meyer. Princeton, N.J. : Princeton University Press. 2006.