CS297 Proposal
Title: Concurrent Multi-crawl System in Yioop!
Frank Chan (uchanf00@gmail.com)
Advisor: Dr. Chris Pollett
Description:
Yioop is a GPLv3, open source, PHP search engine. Yioop can be configured as either a general purpose search engine for the entire web or provide search results for a set of urls or domains.
With current Yioop functionality, a name server is able to do single batch crawls of the web at one time. We will look to add a multi-batch scheduling system so that more than 1 crawl can be started at a given time. Being able to have multiple crawls running concurrently gives Yioop the ability to crawl different subsets of the web asynchronously.
Schedule:
Week 1:
Feb.4-10 | Read Zaharia2009, start work on Deliverable #1, write the test plan |
Week 2:
Feb.11-17 | code Deliverable #1, start test plan |
Week 3:
Feb.18-24 | Complete Deliverable #1 |
Week 4:
Feb.25-Mar.3 | Read Chen2010, start work on Deliverable #2, start testplan |
Week 5:
Mar.4-10 | work on Deliverable #2, should start to be able to explain how the crawl system works |
Week 6:
Mar.11-17 | Complete Deliverable #2 |
Week 7:
Mar.18-24 | Research additional information (if needed), read Phan2010 |
Week 8:
Mar.25-31 | Formulate how the system will work |
Week 9:
Apr.1-Apr.7 | Start the system design: create the interfaces needed |
Week 10:
Apr.8-14 | Work on system design: diagram what classes need refactoring |
Week 11:
Apr.15-21 | Test Deliverable #3 for regression. |
Week 12:
Apr.22-28 | Complete Deliverable #3. Start Deliverable #4. |
Week 13:
Apr.29-May.5 | Complete test plan for Deliverable #4 |
Week 14:
May.6-12 | Complete proposal for CS298 |
Week 15:
May.13-19 | Complete Deliverable #4 |
Deliverables:
The full project will be done when CS298 is completed. The following will
be done by the end of CS297:
1. Display list of more than 1 job crawl
2. Create the GUI panels to work with multiple crawls
3. Research on scheduling technique to work with should be complete, write the action plan
4. Refactor structures of the crawl jobs to work with the scheduling technique
References:
[Zaharia2009] Job Scheduling for Multi-user Mapreduce Clusters. Zaharia, M., Borthakur, D., Sarma, J., Elmeleegy, K., Schenker, S., Stoica. Technical Report EECS-2009-55, UC Berkeley Technical Report. (2009)
[Chen2010] Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment IEEE 2010
[Phan2010] Real-time mapreduce scheduling. Phan, L. T., Zhang, Z., Loo, B. T., & Lee, I. University of Pennsylvania. 2010.
[Wolf2010] FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. Joel Wolf, Deepak Rajan, Kirsten Hildrum, Rohit Khandekar, Vibhore Kumar, Sujay Parekh, Kun-Lung Wu, Andrey Balmin. SpringerLink 2010
|