CS267

Chris Pollett

Sep. 5, 2012

# Outline

• Getting Started With Yioop!
• HW Problem
• Getting Started with Nutch

# Yioop!

• Yioop! is a open-source PHP search engine/index/crawler created by me beginning in the Fall of 2009.
• If you are going to use it this semester, it is highly recommended to briefly read the online documentation for Yioop!.
• To run Yioop! you need: a web server that supports PHP 5.2 or greater with at least PDO sqlite support, command-line access to PHP 5.3 compiled with the cURL and GD libraries.
• The version of Apache and PHP that come built-in to a Mac already support this.
• In Ubuntu Linux, the following sequence of commands will get you what you need:
    sudo apt-get install curl
sudo apt-get install apache2
sudo apt-get install php5
sudo apt-get install php5-cli
sudo apt-get install php5-sqlite
sudo apt-get install php5-curl
sudo apt-get install php5-gd

• Alternatively, on Windows/Mac/Linux you could install Xampp. Just make sure to enable the php_curl.dll or php_curl.so extension in the php.ini file.

# Configuration

• Each Yioop! instance makes use of an auxiliary WORK_DIRECTORY. You should choose where you want this to be, make a folder, and make sure the web server has read and write access to this folder.
• Now point your browser to the location on localhost where you put Yioop! You should see a configuration screen like:
• If you see any required components missing or other problems under the components check make sure to fix them.
• Next enter the directory name you created earlier. Click submit.
• You might need to login at this point. The default username is root and password is blank.
• You should now see a configure form with more fields. Make sure to fill out the Queue Server Set-up fieldset and the Crawl Robot Set-up fieldset.
• You are now ready to crawl!

# Crawling

• Crawling involves two scripts from Yioop!'s bin directory as well as the web interface.
• These scripts can be run either from within the web interface (requires that the web server has permision to schedule jobs) or from the command line.
• From the web interface to run these scripts go to Manage Machines, add a machine with a queue server and a fetcher. Then click the "On" button for each of them.
• If you want to run these scripts from the command line, open two command shells and in each go to Yioop!'s bin directory.
• In one type:
php fetcher.php terminal

in the other type:
php queue_server.php terminal

Be aware that for this to work the php command has to be in your path environment variable; otherwise, you need to give the full path to php.
• Now go back to the admin panel of Yioop! and click on Manage Crawls.
• If you type in a name of a crawl and click Start New Crawl, Yioop! will start crawling with whatever the current crawl settings are.
• To see and modify these settings click the Options link next to start new crawl.
• Once you have started a crawl, Yioop! will keep crawling until it runs out of new URLs to crawl and it times out or until you click Stop Crawl.
• From the time you click Stop Crawl until your crawl is ready is not instantaneous so please be patient... Do not stop the queue_server or fetcher. When Yioop! has properly closed the active crawl, it will appear in the Previous Crawls list on the manage crawls page. At this point it should be safe to stop the fetcher and queue_server, if you want.

# Seeing Search Results

• Pick a crawl that is listed under Previous Crawls and click on its "Set as Index" link.
• Click on the Yioop! logo.
• This should take you back to the Yioop! search screen.
• Typing in a query on this screen should now yield results from the crawl you set as index.

# Homework

Problem 1.6. For a given Markov Model, assume there is a finite n so that the current state of the model will always be known after it generates a string of length n or greater. Describe a procedure for converting such a Markov model into an nth order finite context model.

Solution. The nth order finite context model is given by the equation:
M_n(t_(n+1) | t_1 ... t_n) = frac(M_0(t_1 ... t_(n+1)))(sum_(t' in V)(M_0(t_1 ... t_n t'))).
provided we have a 0th order model M_0 for sequences of n+1 terms. So the problem reduces to building such a model from a Markov model with the properties of the problem description. Let F be a Markov Model with the property that there is a finite n so that the current state of the model will always be known after it generates a string of length n or greater. Let V be our vocabulary, S be F's set of state, and let delta(s,t)={s'_1, ..s'_m} be F's transition function (i.e., we can transition on the same term to a set of states). Here s and s' are states and t in V. Let P(s'_1 in delta(s,t)) be the probability of this transition going to state s'_1. Since we have a Markov Model, sum_{t in V, s' in S} P(s' in delta(s,t)) = 1. Starting at the state s and composing delta, this property gives:
sum_(t_i in V, s' in S)P(s' in delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1))) = 1
Here by composition of set-valued delta's we mean the set one gets by feeding any possible output of the previous level in the composition into the next level and taking the union of the output sets. By our requirement on F, for any state s and any sequence of terms t_1, ... t_(n+1), there is only one s' one can be in delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1)). So if we fix any s in S, we can define M_0(t_1 ... t_n t_(n+1)) to be the probability of P(s' in delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1))) for that s' which is in delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1)). This gives a mapping of n+1 sequences of terms to probabilities and the property above guarantees the sum of these probabilities is 1.

# Nutch/Lucene/Solr

• As we said last Wednesday, Nutch is a Java-based crawler developed by Doug Cutting, Lucene is an indexing library, and Solr is a search server.
• Both require Java 1.6 or greater.
• The information on the next couple of slides is largely from the tutorial at http://wiki.apache.org/nutch/NutchTutorial.
• Currently there are two main branches of Nutch 1.5.1 and Nutch 2.0. For Nutch 2.0, only a source distribution is available which means you need ant set up correctly to compile it. Instead, we will consider the Nutch 1.5.1 case.
• First, download a Nutch 1.5.1 binary, unzip it and move it somewhere you know.
• Using a command shell, switch into Nutch's runtime/local folder.
• From this folder you should be able to run bin/nutch and it should give you a list of command line options.
• If it doesn't, type: chmod +x bin/nutch so that it is executable.
• Next setup your JAVA_HOME variable if it isn't already set. To do this in bash, you could type:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home


# Setting Up a Crawl, Crawling

• First, you should give your robot a name by editing the file conf/nutch-site.xml . For example, you could have:
<property>
<name>http.agent.name</name>
<value>My Spider</value>
</property>

• To control what sites you allow your crawler to crawl you edit the file conf/regex-urlfilter.txt. For example,
+^http://([a-z0-9]*\.)*nutch.apache.org/

would allow the crawler to crawl nutch.apache.org and its sub domains.
• Next we make a directory for holding the urls which will serve as seed sites. To do this type:
mkdir -p urls

Then under this urls folder make a text file listing the start urls you would like to crawl from, one line per url.
• We can now tell nutch to do a crawl starting from the urls listed in the urls folder, putting the resulting data in the crawl directory, crawling to a maximum breadth first depth of 3, and at each depth getting topN many urls from a page:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5


# Getting Solr Running

• To use our crawl, we need to set up solr.
• Next from a command shell switch into the example sub-folder of solr. Type:
java -jar start.jar

• You should now be able to go to the pages:
http://localhost:8983/solr/admin

cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml${APACHE_SOLR_HOME}/example/solr/conf/

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*