Chris Pollett > Students > Frank

    Print View

    [Bio]

    [CS297 Proposal]

    [CS297 Blog]

    [Deliverable 1]

    [Deliverable 2]

                          

























CS297 Proposal

Title: Concurrent Multi-crawl System in Yioop!

Frank Chan (uchanf00@gmail.com)

Advisor: Dr. Chris Pollett

Description:

Yioop is a GPLv3, open source, PHP search engine. Yioop can be configured as either a general purpose search engine for the entire web or provide search results for a set of urls or domains.

With current Yioop functionality, a name server is able to do single batch crawls of the web at one time. We will look to add a multi-batch scheduling system so that more than 1 crawl can be started at a given time. Being able to have multiple crawls running concurrently gives Yioop the ability to crawl different subsets of the web asynchronously.

Schedule:

Week 1: Feb.4-10Read Zaharia2009, start work on Deliverable #1, write the test plan
Week 2: Feb.11-17code Deliverable #1, start test plan
Week 3: Feb.18-24Complete Deliverable #1
Week 4: Feb.25-Mar.3Read Chen2010, start work on Deliverable #2, start testplan
Week 5: Mar.4-10work on Deliverable #2, should start to be able to explain how the crawl system works
Week 6: Mar.11-17Complete Deliverable #2
Week 7: Mar.18-24Research additional information (if needed), read Phan2010
Week 8: Mar.25-31Formulate how the system will work
Week 9: Apr.1-Apr.7Start the system design: create the interfaces needed
Week 10: Apr.8-14Work on system design: diagram what classes need refactoring
Week 11: Apr.15-21Test Deliverable #3 for regression.
Week 12: Apr.22-28Complete Deliverable #3. Start Deliverable #4.
Week 13: Apr.29-May.5Complete test plan for Deliverable #4
Week 14: May.6-12Complete proposal for CS298
Week 15: May.13-19Complete Deliverable #4

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Display list of more than 1 job crawl

2. Create the GUI panels to work with multiple crawls

3. Research on scheduling technique to work with should be complete, write the action plan

4. Refactor structures of the crawl jobs to work with the scheduling technique

References:

[Zaharia2009] Job Scheduling for Multi-user Mapreduce Clusters. Zaharia, M., Borthakur, D., Sarma, J., Elmeleegy, K., Schenker, S., Stoica. Technical Report EECS-2009-55, UC Berkeley Technical Report. (2009)

[Chen2010] Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment IEEE 2010

[Phan2010] Real-time mapreduce scheduling. Phan, L. T., Zhang, Z., Loo, B. T., & Lee, I. University of Pennsylvania. 2010.

[Wolf2010] FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. Joel Wolf, Deepak Rajan, Kirsten Hildrum, Rohit Khandekar, Vibhore Kumar, Sujay Parekh, Kun-Lung Wu, Andrey Balmin. SpringerLink 2010