Chris Pollett > Students >

    ( Print View )


    [Project Blog]

    [CS297 Proposal]

    [Deliverable 1]

    [Deliverable 2]

    [Deliverable 3]

    [Deliverable 4]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Progress Report]

    [CS298 Report - PDF]

    [CS298 Presentation - PDF]


Deliverable 1 - Understanding Search Engine


The goal of this deliverable is to understand Dr.Pollett's Search Engine.

Report on the study

To run the search engine, Apache, PHP and Mysql need to be installed. The database to be maintained needs to be built first. Seed sites are provided to start the crawling. Crawling is done to populate the database with websites. Page rank algorithm is used to calculate the page ranks.

The crawler begins by creating the required tables in the database. It grabs a crawl page from the list of websites to be crawled and checks for the robots.txt file.If no robots.txt file exists, it proceeds to process for summary data. It then extracts the title, description and links from the web page. The crawler then updates the list to be crawled with links extracted. The summary of the page is then stored in the database. The process continues until either a certain number of URLs or a certain depth is reached. An adjacency matrix is maintained which represents all pages that link to a given page.

Once crawling is done, the page rank algorithm is implemented to calculate the pageranks of the pages. The adjacency matrix is converted into a stochastic matrix to preserve probabilities. An initial vector of all 1’s is chosen. The vector is multiplied with the stochastic matrix until it converges. Some adjustments to the matrix are made to handle dangling nodes and strongly connected components. Dangling nodes are the nodes with no outgoing links.They are handled by assuming that they are connected to every other node with some probability.Strongly connected components are parts of the web graph which do not have any outgoing links to the rest of the graphs. A random surfer adjustment is made to handle strongly connected components.

Presentation Slides

My Understanding of Dr.Pollett's Search Engine[PDF]