Learning to Crawl




CS267

Chris Pollett

Sep. 5, 2012

Outline

Yioop!

Configuration

Crawling

  • Crawling involves two scripts from Yioop!'s bin directory as well as the web interface.
  • These scripts can be run either from within the web interface (requires that the web server has permision to schedule jobs) or from the command line.
  • From the web interface to run these scripts go to Manage Machines, add a machine with a queue server and a fetcher. Then click the "On" button for each of them.
    GUI for managing Queue Servers and Fetchers
  • If you want to run these scripts from the command line, open two command shells and in each go to Yioop!'s bin directory.
  • In one type:
    php fetcher.php terminal
    
    in the other type:
    php queue_server.php terminal
    
    Be aware that for this to work the php command has to be in your path environment variable; otherwise, you need to give the full path to php.
  • Now go back to the admin panel of Yioop! and click on Manage Crawls.
  • If you type in a name of a crawl and click Start New Crawl, Yioop! will start crawling with whatever the current crawl settings are.
  • To see and modify these settings click the Options link next to start new crawl.
  • Once you have started a crawl, Yioop! will keep crawling until it runs out of new URLs to crawl and it times out or until you click Stop Crawl.
  • From the time you click Stop Crawl until your crawl is ready is not instantaneous so please be patient... Do not stop the queue_server or fetcher. When Yioop! has properly closed the active crawl, it will appear in the Previous Crawls list on the manage crawls page. At this point it should be safe to stop the fetcher and queue_server, if you want.
  • Seeing Search Results

    Homework

    Problem 1.6. For a given Markov Model, assume there is a finite `n` so that the current state of the model will always be known after it generates a string of length `n` or greater. Describe a procedure for converting such a Markov model into an `n`th order finite context model.

    Solution. The `n`th order finite context model is given by the equation:
    `M_n(t_(n+1) | t_1 ... t_n) = frac(M_0(t_1 ... t_(n+1)))(sum_(t' in V)(M_0(t_1 ... t_n t')))`.
    provided we have a `0`th order model `M_0` for sequences of `n+1` terms. So the problem reduces to building such a model from a Markov model with the properties of the problem description. Let `F` be a Markov Model with the property that there is a finite `n` so that the current state of the model will always be known after it generates a string of length `n` or greater. Let `V` be our vocabulary, `S` be `F`'s set of state, and let `delta(s,t)={s'_1, ..s'_m}` be `F'`s transition function (i.e., we can transition on the same term to a set of states). Here `s` and `s'` are states and `t in V`. Let `P(s'_1 in delta(s,t))` be the probability of this transition going to state `s'_1`. Since we have a Markov Model, `sum_{t in V, s' in S} P(s' in delta(s,t)) = 1`. Starting at the state `s` and composing `delta`, this property gives:
    `sum_(t_i in V, s' in S)P(s' in delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1))) = 1`
    Here by composition of set-valued `delta`'s we mean the set one gets by feeding any possible output of the previous level in the composition into the next level and taking the union of the output sets. By our requirement on `F`, for any state `s` and any sequence of terms `t_1`, ... `t_(n+1)`, there is only one `s'` one can be in `delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1))`. So if we fix any `s in S`, we can define `M_0(t_1 ... t_n t_(n+1))` to be the probability of `P(s' in delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1)))` for that `s'` which is in `delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1))`. This gives a mapping of `n+1` sequences of terms to probabilities and the property above guarantees the sum of these probabilities is 1.

    Nutch/Lucene/Solr

    Setting Up a Crawl, Crawling

    Getting Solr Running