CS299 Project Blog
Classification of Web Pages in Yioop with Active Learning
This page records what Dr. Pollett and I discussed at each of our weekly
meetings, including the progress I made over the previous week and the plan
for the next week. The entries are in chronological order, with newer
entries at the top; the dates are meeting days.
February 26
Plan: Wrap up the administrative interface without integrating the
actual classifiers.
- Added resumable iteration through crawl mixes, so that extra documents
may be loaded in beyond the first batch.
- Refined interface for labelling documents, and added support for
displaying the previously-labelled documents and updating their
labels.
- When labelling documents the positive and negative example counts and
accuracy are updated.
- Now crawl mix iteration files are cleaned up when creating a new crawl
mix or deleting a classifier.
- Added Deliverable 2, a write-up
about the administrative interface.
February 19
Plan: Augment the classifier management interface with the ability to
iterate over existing crawls using crawl mixes, and add some basic permanence
to the create and delete operations. Import one or more of the selected corpora
using the archive crawl interface.
- Added basic data structures and serialization so that classifier
creation, editing, and deletion are remembered and reflected in the
display.
- Implemented basic classification interface which allows the user to
specify a crawl mix and a query, and loads in documents matching the
query. The user can label the documents, and results are sent back to
the server and saved. Only an initial set of documents gets loaded in;
the resume capability of the crawl mix iterator isn't used to load in
more documents.
- We discussed keeping labelled documents on screen so that the user may
change the labels of at least the last N documents.
February 12
Plan: Implement basic create/update/delete Admin interface.
- Basic interface mostly complete using dummy data defined in PHP.
- No progress on a test corpus; we resolved that I should get at least
one corpus imported for next week.
- We discussed an idea for keeping the queue of documents submitted
for human classification full. Suppose we want to maintain N
documents in the oracle queue; then we do the following:
- Initially request 2N documents to be classified, but
display only N, and keep the rest in a buffer.
- Each time the user classifies a document, display a new one
from the buffer.
- When the buffer has K documents left, load in the next
N, and got back to step 2.
- We discussed a number of style issues, and decided that elements of the
$data array passed to views will have all-caps keys.
- We talked about programmatically creating crawl mixes, and decided that
we'll want to add an ownership field to the database records for crawl
mixes. This will enable us to create crawl mixes for the purpose of
iterating over previous crawls during classifier training without
having those crawl mixes show up in the normal management
interface.
February 5
Plan: Complete detailed plan and build test corpus.
- Plan completed and put up on site.
- I investigated several corpora, mostly from the UC Irvine Machine
Learning collection, but didn't actually choose one to import into
Yioop. We decided that it's not important for the corpus to consist of
web pages, and that any text categorization task would be
reasonable.
- We discussed whether it should be possible to edit classifiers while
there's an ongoing crawl, and decided that it should be. When a crawl
is begun (or updated), each fetcher will receive a serialized version
of each classifier from the name server. When a classifier is edited it
will happen on the name server, and fetchers will only see those
changes if the crawl configuration is changed.
- We decided that existing corpora can be imported using the generalized
archive crawl interface.
- We also discussed the need for an iterator that can take a list of
indexed URLs, one per line, and fetch the associated page data. This
facility could be used to specify a seed training set, or (maybe) to
iterate over the documents in the training set.
January 29
Plan: Complete proposal and submit paperwork.
- Committee formed, proposal completed, and paperwork submitted on
time.
- For next time, complete detailed plan for project. This includes use
cases, algorithms to use, and changes to be made to Yioop.
|