Yioop!

Yioop! is a open-source PHP search engine/index/crawler created by me beginning in the Fall of 2009.
It can be downloaded from: http://www.seekquarry.com/?c=main&p=downloads
If you are going to use it this semester, it is highly recommended to briefly read the online documentation for Yioop!.
To run Yioop! you need: a web server that supports PHP 5.2 or greater with at least PDO sqlite support, command-line access to PHP 5.3 compiled with the cURL and GD libraries.
The version of Apache and PHP that come built-in to a Mac already support this.

In Ubuntu Linux, the following sequence of commands will get you what you need:

    sudo apt-get install curl
    sudo apt-get install apache2
    sudo apt-get install php5
    sudo apt-get install php5-cli
    sudo apt-get install php5-sqlite
    sudo apt-get install php5-curl
    sudo apt-get install php5-gd

Alternatively, on Windows/Mac/Linux you could install Xampp. Just make sure to enable the php_curl.dll or php_curl.so extension in the php.ini file.

Configuration

Move the downloaded version of Yioop! under your web server's DocumentRoot.
Each Yioop! instance makes use of an auxiliary WORK_DIRECTORY. You should choose where you want this to be, make a folder, and make sure the web server has read and write access to this folder.
Now point your browser to the location on localhost where you put Yioop! You should see a configuration screen like:
If you see any required components missing or other problems under the components check make sure to fix them.
Next enter the directory name you created earlier. Click submit.
You might need to login at this point. The default username is root and password is blank.
You should now see a configure form with more fields. Make sure to fill out the Queue Server Set-up fieldset and the Crawl Robot Set-up fieldset.
You are now ready to crawl!

Crawling

Crawling involves two scripts from Yioop!'s bin directory as well as the web interface.

Open two command shells and in each go to Yioop!'s bin directory.

In one type:

php fetcher.php terminal

in the other type:

php queue_server.php terminal

Be aware that for this to work the php command has to be in your path environment variable; otherwise, you need to give the full path to php.

Now go back to the admin panel of Yioop! and click on Manage Crawls.

If you type in a name of a crawl and click Start New Crawl, Yioop! will start crawling with whatever the current crawl settings are.

To see and modify these settings click the Options link next to start new crawl.

Once you have started a crawl, Yioop! will keep crawling until it runs out of new URLs to crawl and it times out or until you click Stop Crawl.

From the time you click Stop Crawl until your crawl is ready is not instantaneous so please be patient... Do not stop the queue_server or fetcher. When Yioop! has properly closed the active crawl, it will appear in the Previous Crawls list on the manage crawls page. At this point it should be safe to stop the fetcher and queue_server, if you want.

Seeing Search Results

Pick a crawl that is listed under Previous Crawls and click on its "Set as Index" link.
Click on the Yioop! logo.
This should take you back to the Yioop! search screen.
Typing in a query on this screen should now yield results from the crawl you set as index.

Nutch/Lucene/Solr

As we said last Wednesday, Nutch is a Java-based crawler developed by Doug Cutting, Lucene is an indexing library, and Solr is a search server.
Download Nutch from http://nutch.apache.org/ and Solr from http://lucene.apache.org/solr/.
Both require Java 1.6 or greater.
The information on the next couple of slides is large from the tutorial at http://wiki.apache.org/nutch/NutchTutorial.
Using a command shell, switch into Nutch's runtime/local folder.
From this folder you should be able to run bin/nutch and it should give you a list of command line options.
If it doesn't, type: chmod +x bin/nutch so that it is executable.
Next setup your JAVA_HOME variable if it isn't already set. To do this in bash, you could type:
```
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
```

Setting Up a Crawl, Crawling

First, you should give your robot a name by editing the file conf/nutch-site.xml . For example, you could have:
```
<property>
 <name>http.agent.name</name>
 <value>My Spider</value>
</property>
```
To control what sites you allow your crawler to crawl you edit the file conf/regex-urlfilter.txt. For example,
```
+^http://([a-z0-9]*\.)*nutch.apache.org/
```
would allow the crawler to crawl nutch.apache.org and its sub domains.
Next we make a directory for holding the urls which will serve as seed sites. To do this type:
```
mkdir -p urls
```
Then under this urls folder make a text file listing the start urls you would like to crawl from, one line per url.
We can now tell nutch to do a crawl starting from the urls listed in the urls folder, putting the resulting data in the crawl directory, crawling to a maximum breadth first depth of 3, and at each depth getting topN many urls from a page:
```
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
```

Getting Solr Running

To use our crawl, we need to set up solr.
Download solr using the link a few slides back.
Next from a command shell switch into the example sub-folder of solr. Type:
```
java -jar start.jar
```
You should now be able to go to the pages:
```
http://localhost:8983/solr/admin
http://localhost:8983/solr/admin/stats.jsp
```
As you can see, this jar had a simple example search server that can be created with solr.
Stop this jar. We now to integrate the crawl we made with solr.

To do this, we first copy:

cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/

Restart solr.
Now we need to tell nutch to make a solrindex of its crawl sub-folders and the url on which solr will reside. This command actually will make use of Lucene in the backend, but we don't have to know that.
```
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
```
We should now be able to use http://localhost:8983/solr/admin to see crawl results.

Inverted Indices

diagram with dictionary and posting list of an inverted index

An inverted index provides a mapping between terms and their locations in a text collection `C`.
The two main components of such an index are: (1) a dictionary which lists terms contained in the vocabulary of the collection and (2) for each term in the dictionary, a posting list which gives the positions in the collection at which the term occurs.
The diagram above is an example of a schema-independent index because it makes no assumption about the structure of the underlying text. In particular, it gives a raw position number rather than assume the collection is split into documents and further positions within documents.
An inverted index can be viewed as an abstract data type with the following four methods:
- first(t) returns the first position at which the term `t` occurs in the collection.
- last(t) returns the last position at which the term `t` occurs in the collection.
- next(t, current) returns the position of the first occurence of `t` after the position current in the collection.
- prev(t, current) returns the position of the first occurence of `t` before the position current in the collection.
For example, in the above diagram: first("hurlyburly") = 316669, last("thunder") = 1247139, next("witch", 745451)=745467, prev("witch", 745451) = 745429.
A sequential scan of a posting list might be implemented by applying the first function on the term then repeatedly applying the next function till the end of the posting list.

ADT Example: Phrase Search

As an example of using the primitives of our ADT, consider the problem of phrase search.
A phrase search is a search for an exact match of a phrase in a document. For example when we search on "first witch" in quotes we want back only those documents that have the phrase "first witch" not documents which have both terms but not adjacent and in the given order.

This could be implemented using our ADT with the following pseudo-code:

nextPhrase(t[1],t[2], .., t[n], position)
{
   v:=position
   for i = 1 to n do
     v:= next(t[i], v)
   if v == infty then // infty represents after the end of the posting list
      return [infty, infty]
   u := v
   for i := n-1 downto 1 do
     u := prev(t[i],u)
   if(v-u == n - 1) then
     return [u, v]
   else 
     return nextPhrase(t[1],t[2], .., t[n], u) 
}

Learning to Crawl, Inverted Indices

Outline