CS299 Project Blog

Classification of Web Pages in Yioop with Active Learning

This page records what Dr. Pollett and I discussed at each of our weekly meetings, including the progress I made over the previous week and the plan for the next week. The entries are in chronological order, with newer entries at the top; the dates are meeting days.

February 26

Plan: Wrap up the administrative interface without integrating the actual classifiers.

Added resumable iteration through crawl mixes, so that extra documents may be loaded in beyond the first batch.

Refined interface for labelling documents, and added support for displaying the previously-labelled documents and updating their labels.

When labelling documents the positive and negative example counts and accuracy are updated.

Now crawl mix iteration files are cleaned up when creating a new crawl mix or deleting a classifier.

Added Deliverable 2, a write-up about the administrative interface.

February 19

Plan: Augment the classifier management interface with the ability to iterate over existing crawls using crawl mixes, and add some basic permanence to the create and delete operations. Import one or more of the selected corpora using the archive crawl interface.

Added basic data structures and serialization so that classifier creation, editing, and deletion are remembered and reflected in the display.
Implemented basic classification interface which allows the user to specify a crawl mix and a query, and loads in documents matching the query. The user can label the documents, and results are sent back to the server and saved. Only an initial set of documents gets loaded in; the resume capability of the crawl mix iterator isn't used to load in more documents.
We discussed keeping labelled documents on screen so that the user may change the labels of at least the last N documents.

February 12

Plan: Implement basic create/update/delete Admin interface.

Basic interface mostly complete using dummy data defined in PHP.
No progress on a test corpus; we resolved that I should get at least one corpus imported for next week.
We discussed an idea for keeping the queue of documents submitted for human classification full. Suppose we want to maintain N documents in the oracle queue; then we do the following:
1. Initially request 2N documents to be classified, but display only N, and keep the rest in a buffer.
2. Each time the user classifies a document, display a new one from the buffer.
3. When the buffer has K documents left, load in the next N, and got back to step 2.
We discussed a number of style issues, and decided that elements of the $data array passed to views will have all-caps keys.
We talked about programmatically creating crawl mixes, and decided that we'll want to add an ownership field to the database records for crawl mixes. This will enable us to create crawl mixes for the purpose of iterating over previous crawls during classifier training without having those crawl mixes show up in the normal management interface.

February 5

Plan: Complete detailed plan and build test corpus.

Plan completed and put up on site.
I investigated several corpora, mostly from the UC Irvine Machine Learning collection, but didn't actually choose one to import into Yioop. We decided that it's not important for the corpus to consist of web pages, and that any text categorization task would be reasonable.

We discussed whether it should be possible to edit classifiers while there's an ongoing crawl, and decided that it should be. When a crawl is begun (or updated), each fetcher will receive a serialized version of each classifier from the name server. When a classifier is edited it will happen on the name server, and fetchers will only see those changes if the crawl configuration is changed.
We decided that existing corpora can be imported using the generalized archive crawl interface.
We also discussed the need for an iterator that can take a list of indexed URLs, one per line, and fetch the associated page data. This facility could be used to specify a seed training set, or (maybe) to iterate over the documents in the training set.

January 29

Plan: Complete proposal and submit paperwork.

Committee formed, proposal completed, and paperwork submitted on time.
For next time, complete detailed plan for project. This includes use cases, algorithms to use, and changes to be made to Yioop.