Learning to Crawl, Inverted Indices




CS267

Chris Pollett

Sep. 7, 2011

Outline

Yioop!

Configuration

Crawling

  • Crawling involves two scripts from Yioop!'s bin directory as well as the web interface.
  • Open two command shells and in each go to Yioop!'s bin directory.
  • In one type:
    php fetcher.php terminal
    
    in the other type:
    php queue_server.php terminal
    
    Be aware that for this to work the php command has to be in your path environment variable; otherwise, you need to give the full path to php.
  • Now go back to the admin panel of Yioop! and click on Manage Crawls.
  • If you type in a name of a crawl and click Start New Crawl, Yioop! will start crawling with whatever the current crawl settings are.
  • To see and modify these settings click the Options link next to start new crawl.
  • Once you have started a crawl, Yioop! will keep crawling until it runs out of new URLs to crawl and it times out or until you click Stop Crawl.
  • From the time you click Stop Crawl until your crawl is ready is not instantaneous so please be patient... Do not stop the queue_server or fetcher. When Yioop! has properly closed the active crawl, it will appear in the Previous Crawls list on the manage crawls page. At this point it should be safe to stop the fetcher and queue_server, if you want.
  • Seeing Search Results

    Nutch/Lucene/Solr

    Setting Up a Crawl, Crawling

    Getting Solr Running

    Inverted Indices

    diagram with dictionary and posting list of an inverted index

    ADT Example: Phrase Search