Chris Pollett > Students >
Amith

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Del 1: OPIC Algorithm implementation]

    [Del 2: SALSA Algorithm & Nutch]

    [Del 3: Nutch implementation]

    [Del 4: HITS Algorithm implementation]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Final Report - PDF]

    [CS298 Project Source Code - ZIP]

                          

























CS298 Proposal

An Online version of Hyperlink-Induced Topic Search (HITS) based search engine

Amith Kollam Chandranna (amithkc@gmail.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Tsau Young Lin (tylin@cs.sjsu.edu) and Prof Mark Stamp (stamp@cs.sjsu.edu)

Abstract:

In general, search engines perform the ranking of the web pages in an offline mode, which is after the web pages have been retrieved and stored in the database.The existing "Hyperlink-Induced Topic Search" algorithm (HITS) operates in an offline mode of page rank calculation. In this project, we will implement an online mode of page ranking for this algorithm. Performing online ranking offers several benefits. Retrieving web pages and calculating ranking of web pages can be implemented using parallel processing. Also, the disk input/output time is minimized as the ranking is done during runtime by accessing the graph matrix in the main memory. This will improve the overall efficiency of the algorithm. The project also includes creating small test models, researching related journal articles, analyzing reference guides, generating test cases, and comparing test results with other models. This will help to understand the overall processes involved in successfully implementing the modified HITS algorithm.

CS297 Results:

  • Simulated the working of the Online Page Importance Calculation (OPIC) algorithm by creating a Javascript tool.
  • Analyzed ways to implement the Power Method in HITS algorithm.
  • Simulated the working of the Hyperlink-induced Topic Search (HITS) algorithm by creating a Javascript tool.
  • Created presentation slides related to Stochastic Approach for Link-Structure Analysis (SALSA) algorithm and described the implementation and working details of Nutch search engine.
  • Implemented a working model of Nutch search engine on Windows platform.

Proposed Schedule:

Week 1: Aug.26-Sep. 2Test environment setup
Week 2 & 3: Sep.2-15Analyze efficient ways to implement Online ranking module in HITS algorithm
Week 4 & 5: Sep.16-30Implement a working model of the basic HITS algorithm in PHP
Week 6 & 7: Oct.1-13Modify the current HITS algorithm implementation to incorporate the ranking mode
Week 8 & 9: Oct.14-29Test the modified HITS algorithm
Week 10 & 11: Nov.1-10Documentation of the report work and submission to graduate office and committee.
Week 12 & 13: Nov. 11-25Project report modification based on review comments
Week 14 & 15: Nov. 26-Dec.10Defense and final presentation

Key Deliverables:

  • Software
    • Implement the original version of the HITS algorithm in PHP.
    • Implement the online ranking version of the HITS algorithm in PHP.
    • Comparison of modified HITS algorithm with the original version of HITS algorithm, PageRank, and OPIC
  • Report
    • Detailed description of software deliverables
    • Final report and presentation

Innovations and Challenges:

  • Implementing efficient data structures for HITS algorithm in PHP will be challenging. This is important in deciding the efficieny of the implementation.
  • Comparing the test results of various algorithms

References:

N. Langville, Amy., & D. Meyer, Carl. (2006). Google's PageRank and Beyond. Princeton University Press.

Nomura, Saeko., Toru Ishida, Satoshi Oyama., & Hayamizu, Tetsuo. (2004). Analysis and Improvement of HITS Algorithm for Detecting Web Communities. [Electronic version]. ACM Systems and Computers in Japan, Vol 35, Issue 13, 32 - 42.

Borodin, Allan., O. Roberts, Gareth., S. Rosenthal, Jeffrey., & Tsaparas, Panayiotis. (2005). Link analysis ranking: algorithms, theory, and experiments. [Electronic version]. ACM Transactions on Internet Technology (TOIT), Vol 5, Issue 1, 231 - 297.

Lempel, R., & Moran., S. (2001). SALSA: The Stochastic Approach for Link-Structure Analysis. [Electronic version]. ACM Transactions on Information Systems, Vol. 19, 131 - 160.

Nutch Tutorial. (n.d). Retrieved May 07, 2010 from Apache's web site: http://lucene.apache.org/nutch/tutorial.html