Chris Pollett >
Students >
Shawn [Bio] |
CS280 Project BlogI'm working on improving Yioop's archive crawl performance by distributing the work to a scalable number of fetchers (full proposal). Below you'll find my planned schedule for the semester followed by progress reports, with the newest ones at the top. Schedule
Progress ReportsNote that newer progress reports come at the top. May 15I've completed the majority of what I set out to do, and written up my results in a report (pdf). As the report mentions, I ended up spending a lot of time modifying the existing archive iterators to be able to save and restore state between requests to the name server for batches of pages. In the end, the new distributed crawl process should be working for ARC, MediaWiki, and ODP RDF bundles. Web re-crawls and live web crawls should work exactly as they always have. I didn't have as much time as I wanted in order to test the system, and I suspect that once it's used to crawl a very large archive (like a full dump of Wikipedia) we will find that the tuning parameters need some work. In the coming weeks I may get an opportunity to test the system out on several virtual machines provided by SJSU in order to try out a full Wikipedia crawl. Finally, as part of my experiments with seeking into bzip2 files, I converted a minimal Javascript implementation of bzip2 decompression into PHP. I ended up pursuing a much faster heuristic strategy (using a regex to match the magic number at the beginning of each block), but someone may have some use for doing bzip2 decompression in PHP, so I've posted the code here. April 10As per our discussion I found a javascript implementation of bzip2 decompression and converted it to PHP. After verifying that the PHP implementation was capable of decompressing a file compressed using the standard bzip2 utility, I added support for serializing the PHP class that implements decompression, and restoring at a particular offset into the compressed file. I then modified the MediaWiki archive iterator to use this class to decompress MediaWiki archives, and to actually save state and restore partially through the file, rather than starting over at the beginning each time. April 3I implemented most of the new archive crawl procedure, and got data moving between the name server and a fetcher, and between a fetcher and a queue server. I need to decide whether compressing the page data (after it's decompressed and parsed from the archive) has large enough savings in network bandwidth to justify the cost of compressing on the name server and decompressing at the fetcher. I also ended up modifying the protocol a bit to better mirror the normal web crawl flow, where the fetcher gets a new crawl time from the name server and then gets updated crawl parameters from the current queue server. I ran into a problem working with MediaWiki archives because PHP doesn't allow seeking on files opened with March 20I fixed up old archive crawls to work with all of the various archive iterators, and submitted my changes to go out with the next Yioop! release, but otherwise made no progress. I plan to catch up over Spring break by actually making the necessary changes to send archive data over the local network. March 13I got old web archive crawls working again, and continued to work on setting up the new archive crawl architecture. I modified the web archive bundle iterator to take a prefix as an extra parameter, which it uses to find the appropriate archive to iterate over. While trying to re-crawl my previous web crawl, the iterator ran into the following error after processing the first 5030 pages: 0-fetcher.log:13334: Web archive saw blank line when looked for offset 123890950 I had a lot of work this past week to prepare for an upcoming project due date, and so I'm behind on specifying a common archive iterator interface and getting new crawls started. I'm planning to do what I can this week and next, but I expect to have to catch up over the break. I've determined that I definitely need to modify the March 06I continued to work toward implementing the new archive crawl logic. I modified I modified the admin controller and I also worked on fixing up the normal re-crawl process. I modified the fetcher to set its own fetcher number to 0 if it isn't given a number when invoked, and rewrote the logic where it writes files to always prefix the file name with its fetcher number. So now the fetcher finds the appropriate archive directory, but it doesn't get any data because the iterator doesn't know about fetcher prefixes. As part of building a consistent archive iterator interface, I'll be changing iterators to work with filenames instead of just crawl times, so that they don't need to be so aware of file structure. That should fix this particular problem. February 28I worked out a schedule for the rest of the semester and set up the project blog with my work so far and the full schedule. I tried getting an archive crawl going on a previous web crawl, but I couldn't get the fetcher to recognize the archive. So I moved on to reading the Yioop source to figure out how the list of available archives gets generated, and how an archive crawl procedes now. The user interface will remain effectively the same for now. The only major differences happen on the backend; it should take less work on the command line to set up an archive crawl. It should suffice to create a directory under the (new) Outline of code changes:
For next week I need to actually make the changes to implement the new protocol for archive crawling, and get the name server sending chunks of pages for Wikimedia archives. February 21I drew up a diagram of a typical Yioop network configuration, another diagram of a typical web crawl from a single fetcher's perspective, and a final diagram of my proposed archive crawl, again from a single fetcher's perspective: I also wrote some code to use the wikimedia archive bundle iterator in order to fetch chunks of page data from a wikimedia archive. In the process I learned that bundle iterators don't implement a standard interface, which I'll almost certainly want to change as I add a unified interface for performing archive crawls. I found that fetching pages from an arbitrary offset in the archive file was very slow. I'll need to investigate options for speeding up the time it takes to read from the archive in general, and if we want to be able to stop and re-start archive crawls then I'll also need to investigate creating an index as iteration proceeds, so that it's fast to seek to an arbitrary point. |