Making Course Documents Accessible to Web Crawlers




CS216

Chris Pollett

Mar 12, 2010

Outline

Introduction

The Anatomy of a Web Crawler and Search Engine

Here is an example architecture of a search engine, each real engine will be slightly different

The Queue Server, Fetcher, Indexer, and Search Web-site components of a Search Engine

Mainly Search Engines Look at Your Web-sites

Here is a typical (slightly redacted) snippet of the access logs of a Web Server. Notice no humans!

174.36.70.156 - - [12/Mar/2010:03:40:40 -0600] "GET /productdetail_3109.html HTTP/1.0" 
200 13821 "http://www.ucanbuyart.com/artistproducts/avadala/0/6/" 
"Mozilla/5.0 (compatible; heritrix/1.14.3 +http://www.accelobot.com)"
220.181.7.68 - - [12/Mar/2010:03:40:53 -0600] "GET / HTTP/1.1" 
200 10179 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
123.125.66.68 - - [12/Mar/2010:03:40:59 -0600] "GET / HTTP/1.1" 
200 14929 "-" "Baiduspider+(+http://www.baidu.com/search/spider.htm)"
66.249.71.181 - - [12/Mar/2010:03:41:09 -0600] "GET /classifieds.html HTTP/1.1" 
200 7273 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
174.36.70.156 - - [12/Mar/2010:03:41:11 -0600] "GET /productdetail_3107.html HTTP/1.0" 
200 13817 "http://www.ucanbuyart.com/artistproducts/avadala/0/6/" 
"Mozilla/5.0 (compatible; heritrix/1.14.3 +http://www.accelobot.com)"
174.36.70.156 - - [12/Mar/2010:03:41:39 -0600] "GET /productdetail_3105.html HTTP/1.0" 
200 13851 "http://www.ucanbuyart.com/artistproducts/avadala/0/6/" 
"Mozilla/5.0 (compatible; heritrix/1.14.3 +http://www.accelobot.com)"
77.88.31.246 - - [12/Mar/2010:03:42:05 -0600] "GET /productdetail_5912.html HTTP/1.1" 
200 5530 "-" "Yandex/1.01.001 (compatible; Win16; I)"
67.195.110.188 - - [12/Mar/2010:03:43:18 -0600] "GET /productdetail_5665.html HTTP/1.0" 
200 5476 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; 
http://help.yahoo.com/help/us/ysearch/slurp)"
207.46.199.193 - - [12/Mar/2010:03:44:47 -0600] "GET /robots.txt HTTP/1.1" 
404 10820 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
64.124.148.25 - - [12/Mar/2010:03:44:54 -0600] "GET /productdetail_7657.html HTTP/1.1" 
200 5455 "-" "Mozilla/5.0 (compatible; FatBot 2.0; http://www.thefind.com/crawler)"

Components of a Search Engine and Web Page Visibility

Fetcher

Media Processors

Link Extraction

The Queue Server

The Indexer

Conclusion

Hopefully, you have learned some useful things to keep in mind if you want your site to be visible on the web...