Chris Pollett >
Students >
Gargi
( Print View)
[Bio]
[Blog]
[C297 Proposal]
[Paper 1: Large-scale IRLBot crawl PDF]
[Paper 2: Distributed Crawler Architecture PDF]
[Paper 3: Scalability Challenges PDF]
[Paper 4: High Performance Priority Queues PDF]
[Deliverable 1: Yioop Ranking Mechanisms PDF]
[Deliverable 2(ii): Modifying Yioop's UI Editor]
[Deliverable 3: Modifying Yioop's queuing process]
[Deliverable 4: Yandex Signal PDF]
[CS 297 Report PDF]
[C298 Proposal]
[C298: Yandex-inspired Search Factors]
[C298: Latest Page Version in SERP]
[C298: Disjunctive Queries in Yioop Search]
[CS 298 Report PDF]
[CS 298 Report Slides PDF]
|
Deliverable#3: Modifying Yioop's queueing process to retain expected order
Abstract
Implementing a Selective Repeat ARQ mechanism in the crawling process to ensure that documents are written to the inverted index in the same order that they were scheduled to be.
Outline
- Sender and receiver arrays of window size = total number of active fetchers in the crawl
- Each fetch schedule created by a queue server is given a sequence number (initialized to 1)
- For every posted webpage data from a fetcher, the corresponding sequence number is acknowledged in the sender array
- If the incoming webpage data is ordered correctly, it is uploaded into inverted index files. If not, it is stored in the receiver array until the prior scheduled sequence numbers are received, and they are all uploaded in order
- If a schedule is not acknowledged before it times out, it is rescheduled
Implementation
Sender and receiver arrays are maintained by the FetchController [controllers/FetchController.php]
Sender array contains SenderMessage [library/SenderMessage.php] objects
SenderMessage object: {$seq_num: int, $schedule_time: int, $status: str, $reschedule_count: int}
- $seq_num: sequence number for the schedule
- $schedule_time: timestamp of when the schedule was last given to a fetcher
- $status: status of acknowledgement (scheduled "s" or acknowledged"a")
- $reschedule_count: number of times the object was rescheduled
Receiver array contains ReceiverMessage [library/ReceiverMessage.php] objects
ReceiverMessage object: {$seq_num: int, $schedule_time: int, $diff_origin: bool, $filename: str}
- $seq_num: sequence number for the schedule
- $schedule_time: timestamp of when the schedule was given to the fetcher
- $diff_origin: flag indicating whether the schedule originated from a different queue server
- $filename: path to temporary storage of webpage data (before it is uploaded)
How it works
-
Scheduling
- Queue Server produces fetch batches only if the sender window is not full
- For each fetch batch produced, the sequence number is included in meta info, as well as appended to the schedule.txt filename
- The next sequence number to be assigned is also tracked
- FetchController maintains the sender and receiver arrays
- When a fetcher requests a schedule, the FetchController finds the next sequence number to assign: either rescheduling objects that have timed out, or assigns the next sequence number (if file exists)
- Once a new SenderMessage object is created, it is given a status of "s" and current timestamp of scheduling
- Fetcher maintains the sequence number for the current schedule
- When the webpage data is being posted back to the queue server web app, the sequence number and a flag indicating whether the destination queue server is the same as the scheduling one (i.e. $diff_origin) are both attached to $post_data
-
Receiving
- Once data is posted back from the fetcher, the FetchController writes the a new ReceiverMessage object to the receiver array
- This data is arranged in the receiver based on it's $diff_origin flag
- If $diff_origin is false, the object is arranged in the receiver based on it's sequence number
- If $diff_origin is true, the object is arranged in the receiver based on it's $schedule_time field (if it doesn't fit into the current window, i.e. it was scheduled before the first current object in the sender array, it is discarded)
- If $diff_origin is false, the corresponding SenderMessage object's status is changed to "a" and it's schedule.txt file is deleted
-
Uploading
- Once the ReceiverMessage object has been created, we check if the receiver's contents are ordered
- (Iterating over the receiver) For every sequence number that matches the sender array's first scheduled sequence number, the data is uploaded (and removed from both the sender and receiver)
- Intermediary $diff_origin schedules are uploaded without comparing with the sender array
Helper modifications
-
In case the crawl is left incomplete:
Data related to the crawl is maintained in temporary files. This data consists of SlidingWindow.txt (tracking the sender/receiver arrays in FetchController), next_seq_num.txt (tracking the next sequence number to be assigned by QueueServer), and current_crawl_timestamp.txt (the current crawl timestamp). Once the crawl is stopped for any reason (manually stopping the crawl, server is stopped, or machine is turned off), the former two files and any created schedule.txt files are shifted into /cache/WindowData. current_crawl_timestamp.txt is discarded. Once the crawl is resumed, these files are moved back into /schedules and crawling is resumed. On resuming the crawl/starting a new crawl, current_crawl_timestamp.txt is written back into with the current crawl timestamp. This logic is added to saveWindowData() [library/CrawlDaemon.php].
-
To get the window size (for sender array):
The window size is dictated by the number of active fetchers. This information is obtained from models/MachineModel.php and passed to the queue server through the QueueServerMessages.txt file. MachineModel defines a function getActiveFetchersCount() that returns the number of active fetchers by unravelling the data obtained from getMachineStatuses(). This active_fetcher_count is subsequently passed through the sendStartCrawlMessage() [components/CrawlComponent.php] trace to be written into queue server message files for all instances.
-
Misc. modifications:
- In case a crawl is resumed/started while there is another ongoing crawl, the current crawl needs to be stopped correctly. The previous logic had missed setting the crawl status to CONTINUE_STATE ($info[self::STATUS] = self::CONTINUE_STATE;)
- When a fetcher is posting back to the web app, it now checks to see $found_sites for both TO_CRAWL and SEEN_URLS to decide whether to update the scheduler or not
|