Chris Pollett >
Students >
Frank [Bio] |
CS297 ProposalTitle: Concurrent Multi-crawl System in Yioop!Frank Chan (uchanf00@gmail.com) Advisor: Dr. Chris Pollett Description: Yioop is a GPLv3, open source, PHP search engine. Yioop can be configured as either a general purpose search engine for the entire web or provide search results for a set of urls or domains. With current Yioop functionality, a name server is able to do single batch crawls of the web at one time. We will look to add a multi-batch scheduling system so that more than 1 crawl can be started at a given time. Being able to have multiple crawls running concurrently gives Yioop the ability to crawl different subsets of the web asynchronously. Schedule:
Deliverables: The full project will be done when CS298 is completed. The following will be done by the end of CS297: 1. Display list of more than 1 job crawl 2. Create the GUI panels to work with multiple crawls 3. Research on scheduling technique to work with should be complete, write the action plan 4. Refactor structures of the crawl jobs to work with the scheduling technique References: [Zaharia2009] Job Scheduling for Multi-user Mapreduce Clusters. Zaharia, M., Borthakur, D., Sarma, J., Elmeleegy, K., Schenker, S., Stoica. Technical Report EECS-2009-55, UC Berkeley Technical Report. (2009) [Chen2010] Samr: A self-adaptive mapreduce scheduling algorithm in heterogeneous environment IEEE 2010 [Phan2010] Real-time mapreduce scheduling. Phan, L. T., Zhang, Z., Loo, B. T., & Lee, I. University of Pennsylvania. 2010. [Wolf2010] FLEX: A Slot Allocation Scheduling Optimizer for MapReduce Workloads. Joel Wolf, Deepak Rajan, Kirsten Hildrum, Rohit Khandekar, Vibhore Kumar, Sujay Parekh, Kun-Lung Wu, Andrey Balmin. SpringerLink 2010 |