Chris Pollett >
Students > [Bio] [Blog] [Relevance Ranking(JRFL) slides - pdf] |
Prototype framework for NewsFeedBundle.Currently, Yioop runs MediaUpdater which aggregates several different update jobs. The job we are interested in is the FeedUpdateJob, which looks at a list of sources from the MEDIA_SOURCES table in the database. For each source, we parse out the necessary information, add it into the database in FEED_ITEMS. The problem with this existing approach is that storing it exclusively in the database puts some limitations on how many items we can store. In contrast, the main search engine part of Yioop stores items using IndexShards, which are grouped into bundles as each shard is only meant to store up to a certain limit. During a crawl, we just add whatever document or link that we see and then move on. For a news crawl however, it would be prudent to design it in such a way that we access the newest items first before moving backwards in time. Since it is stored on the database right now, it is simple to just sort through by timestamp in descending order, but the goal of this project is to migrate this storage into shards and bundles, hence NewsFeedBundles.
|