Deliverable 1
My Experiments with Yioop, URL Shortening Services and Creating Patch for Yioop
Objective: The main objective of this deliverable was to perform basic crawls in Yioop, study how URL shortening service links are handled by Yioop, write scripts experimenting with these services and create a patch for Yioop to handle short links correctly.
Description:
Firstly, basic crawls were performed in Yioop. The following steps were followed to perform a basic crawl in Yioop and index it.
- Downloaded and installed Xampp and Yioop.
- Yioop folder was placed in C:\Xampp\htdocs folder.
- Xampp Control Panel was started and Apache and MySql were started.
- Started http://localhost/yioop in the browser.
- Yioop was configured according to the steps given in Yioop documentation and the configuration settings were saved.
- The scripts fetcher.php and queue_server.php were started in the command prompt before the crawling starts.
- A crawl name was given a name and the crawl was started.
- Once the crawl was stopped it shows up in the previous crawls section.
- The crawl performed was set as the index so that the search would return results from the crawled data.
- The keywords were entered and few search results were displayed.
The experiments done can be seen in the following figures:
Figure 1: Yioop Homepage
|
Figure 2: Creating a new crawl
|
Figure 3: Set the crawl as index
|
Figure 4: Search results for Yahoo
|
Figure 5: One of the results page
|
Secondly, a crawl was done only on one bit.ly link which was a short link for yahoo website. The following steps were followed to perform this crawl.
- The same steps as mentioned above were followed to perform this crawl with minor changes.
- Before starting the crawl the seed sites information was updated.
- All the sites were removed and were replaced with a single bit.ly link.
- The changes were saved and a new crawl was started.
- The crawl performed was set as index so that the search would return results from the crawled data.
- The bit.ly link given for this crawl was of Yahoo.
- A search was done with keyword Yahoo and the results showed Yahoo home page link as the first result.
- Then the information on the bit.ly link was seen and it was found that the rank was given to the bit.ly link instead of the original link.
The experiments done can be seen in the following figures:
Figure 6: Provide Yahoo bitly link as the seed site
|
Figure 7: Create a new crawl to crawl on the bitly link
|
Figure 8: Search results for Yahoo
|
Figure 9: Information on the bitly link
|
Thirdly, I have written a PHP script with cURL functions to convert a bit.ly link or any short link to its original link. The following steps were followed to run this script.
- Wrote the script in a text file and saved it in htdocs folder with a PHP extension.
- Opened localhost/shorttolong.php in the browser.
- Entered a bit.ly link for SJSU site.
- The script returned the original link.
The input and output of the script can be seen as follows:
Figure 10: Entering the short URL
|
Figure 11: Original URL obtained
|
Fourthly, new code was added to Yioop in order to handle short URL links correctly. In the first deliverable it was seen how Yioop handles short URL links. It was shown that the crawling was performed on the short URL and not on the original link and it was seen that rank was given to the short link service rather than to where the link points to. This is modified in this deliverable. The steps involved in this process are:
- Adding new code
- Testing the code
- Addition of code involved the following steps:
- The code was added in three files of Yioop.
- The files are crawl_constants.php, fetch_url.php and fetcher.php.
- A new constant has been added in crawl_constants.php file for storing the Location attribute which has the original link where the short link service points to.
- The next step involved the extraction of the Location information from the short URL. This was performed by adding code in the fetch_url.php file.
- The last step in addition of code was to add code in fetcher.php file. This step changes the short URL to the original link using the location information extracted.
- After the above steps, the short URL is changed to the original link and crawling is done on the original link.
- Testing the code involved the following steps:
- A crawl was performed on the bitly link of Yahoo before the addition of code.
- This crawl was set as index and a search was performed with the keyword Yahoo and the results obtained were seen.
- Code is added following the above steps.
- A crawl was performed on the same link after the addition of code.
- The new crawl was set as index and a search was performed with the keyword Yahoo and the results obtained were seen.
- The results in the new crawl showed that the rank was assigned correctly to the main yahoo page and not the short link service.
- The results also showed the main link as search results and not the short link service and the results obtained in both the cases were different.
- The results obtained before and after the addition of code can be seen below in the pictures.
- Search was performed on two keywords: Yahoo, IBM
Figure 12: Search results for Yahoo before addition of code
|
Figure 13: Search results for Yahoo after addition of code
|
Figure 14: Search results for IBM before addition of code
|
Figure 15: Search results for IBM after addition of code
|
|