CS267
Chris Pollett
Sep. 5, 2012
sudo apt-get install curl sudo apt-get install apache2 sudo apt-get install php5 sudo apt-get install php5-cli sudo apt-get install php5-sqlite sudo apt-get install php5-curl sudo apt-get install php5-gd
php fetcher.php terminalin the other type:
php queue_server.php terminalBe aware that for this to work the php command has to be in your path environment variable; otherwise, you need to give the full path to php.
Problem 1.6. For a given Markov Model, assume there is a finite `n` so that the current state of the model will always be known after it generates a string of length `n` or greater. Describe a procedure for converting such a Markov model into an `n`th order finite context model.
Solution. The `n`th order finite context model is given by the equation:
`M_n(t_(n+1) | t_1 ... t_n) = frac(M_0(t_1 ... t_(n+1)))(sum_(t' in V)(M_0(t_1 ... t_n t')))`.
provided we have a `0`th order model `M_0` for sequences of `n+1` terms. So the problem reduces to building such a model from a Markov model with
the properties of the problem description. Let `F` be a Markov Model with the property that there is a finite `n` so that the current state of the model will always be known after it generates a string of length `n` or greater. Let `V` be our vocabulary, `S` be `F`'s set of state, and let `delta(s,t)={s'_1, ..s'_m}` be `F'`s transition function (i.e., we can transition on the same term to a set of states). Here `s` and `s'` are states and `t in V`. Let `P(s'_1 in delta(s,t))` be the probability of this transition going to state `s'_1`. Since we have a Markov Model, `sum_{t in V, s' in S} P(s' in delta(s,t)) = 1`.
Starting at the state `s` and composing `delta`, this property gives:
`sum_(t_i in V, s' in S)P(s' in delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1))) = 1`
Here by composition of set-valued `delta`'s we mean the set one gets by feeding any possible output of the previous level in the composition into the next level and taking the union of the output sets.
By our requirement on `F`, for any state `s` and any sequence of terms `t_1`, ... `t_(n+1)`, there is only one `s'` one can be in `delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1))`. So
if we fix any `s in S`, we can define `M_0(t_1 ... t_n t_(n+1))` to be the probability of `P(s' in delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1)))` for that `s'`
which is in `delta(delta(...delta(s,t_1)...,t_(n)), t_(n+1))`. This gives a mapping of `n+1` sequences of terms to probabilities and the property above guarantees the sum of these probabilities is 1.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home
<property> <name>http.agent.name</name> <value>My Spider</value> </property>
+^http://([a-z0-9]*\.)*nutch.apache.org/would allow the crawler to crawl nutch.apache.org and its sub domains.
mkdir -p urlsThen under this urls folder make a text file listing the start urls you would like to crawl from, one line per url.
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
java -jar start.jar
http://localhost:8983/solr/admin http://localhost:8983/solr/admin/stats.jspAs you can see, this jar had a simple example search server that can be created with solr.
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*