Chris Pollett > Students >
Tim

    ( Print View)

    [Bio]

    [Blog]

    [CS 297 Proposal]

    [CS 298 Proposal]

    [Deliverable 1]

    [Deliverable 2]

    [Deliverable 3]

    [Deliverable 4]

    [Relevance Ranking(JRFL) slides - pdf]

    [Relevance Ranking(cluster) slides - pdf]

    [CS 297 Report.pdf]

    [CS 298 Report.pdf]

    [CS 298 Slides - pdf]

Prototype framework for NewsFeedBundle.

Currently, Yioop runs MediaUpdater which aggregates several different update jobs. The job we are interested in is the FeedUpdateJob, which looks at a list of sources from the MEDIA_SOURCES table in the database. For each source, we parse out the necessary information, add it into the database in FEED_ITEMS. The problem with this existing approach is that storing it exclusively in the database puts some limitations on how many items we can store. In contrast, the main search engine part of Yioop stores items using IndexShards, which are grouped into bundles as each shard is only meant to store up to a certain limit. During a crawl, we just add whatever document or link that we see and then move on. For a news crawl however, it would be prudent to design it in such a way that we access the newest items first before moving backwards in time. Since it is stored on the database right now, it is simple to just sort through by timestamp in descending order, but the goal of this project is to migrate this storage into shards and bundles, hence NewsFeedBundles.

There are two current approaches to going about this, one would be designing a new class that constructs the shards and the dictionaries inside each one in reverse order, and the other would be to keep the existing construction method, but instead we change it so that we traverse it backwards. The plan right now is go with the second method, with the expectation of making as few minimal changes as possible to the existing code. For IndexArchiveBundle, I might need to create a flag setting which would tell subsequent iterators to read the shards and their in reverse as opposed to the usual. Just for convenience, I will refer to bundles with this flag set as reverse bundles. When an iterator reads in a reverse bundle, one way we could accommodate this is to start our numbering at zero and actually move into the negatives the more recent something is. That way, our iterator would still be going from recent to old, while also retaining the old format of going from small to high.