Chris Pollett > Students >
Gargi

    ( Print View)

    [Bio]

    [Blog]

    [C297 Proposal]

    [Paper 1: Large-scale IRLBot crawl PDF]

    [Paper 2: Distributed Crawler Architecture PDF]

    [Paper 3: Scalability Challenges PDF]

    [Paper 4: High Performance Priority Queues PDF]

    [Deliverable 1: Yioop Ranking Mechanisms PDF]

    [Deliverable 2(ii): Modifying Yioop's UI Editor]

    [Deliverable 3: Modifying Yioop's queuing process]

    [Deliverable 4: Yandex Signal PDF]

    [CS 297 Report PDF]

    [C298 Proposal]

    [C298: Yandex-inspired Search Factors]

    [C298: Latest Page Version in SERP]

    [C298: Disjunctive Queries in Yioop Search]

    [CS 298 Report PDF]

    [CS 298 Report Slides PDF]

Yandex-inspired Search Factors (CS298 Deliverable#1)

Aim:
Implementing two new search factor bonuses (WIKI_BONUS and NUM_SLASHES_BONUS) to be added to Doc Rank scores inspired by Yandex.

WIKI_BONUS:
This bonus will be added to a URL's doc rank score if it points to a Wikipedia page, i.e., if the URL hostname contains the substring 'wikipedia'.

NUM_SLASHES_BONUS:
The value of this bonus is inversely dependent on the number of slashes trailing the hostname in a URL. The idea behind this calculation is that pages closer to the "root" page will be more important than those nested further away.

Background:
Each set of implemented changes (various values of the search factor bonuses) was tested by crawling between 30000 - 60000 URLs. The final implementation was tested on a crawl of 1868398 URLs.
To check how each set of values fared, I tested random combinations of keywords and meta-keywords. These included google, apple, wikipedia, yahoo no:guess, verizon, weather, ebay lang:en, site:google.com, site:apple.com, site:pinterest.com.

Finding the right value for WIKI_BONUS:
I experimented with different values for WIKI_BONUS:

  • 5
  • 1
  • 0.75
  • 0.5
  • 0.25
Observations:
  • All values above 0.5 boosted Wikipedia too far up.
    Eg. On testing with scores 1 and 5, https://www.wikipedia.org appeared as the third and first result respectively on searching for google, verizon, and weather. On testing with 0.75, https://www.wikipedia.org appeared as the second result for keyword apple, above all https://www.apple.com/retail URLs.

  • WIKI_BONUS = 0.5 gave the most appropriate search results for the tested keywords. All of the host URLs appeared as the first search result in the SERP, followed by seemingly more important URLs (such as us.yahoo.com, yahoo.com/plus/..., developers.apple.com, cloud.google.com, etc.), and wikipedia.org ranked higher than other (further nested) URLs (such as apple.com/am/privacy/control, barnesandnoble.com for apple).

  • Although 0.25 did give good results as well, the score did not boost Wikipedia pages higher than other lesser-relevant pages for all the search words used. For some, the SERP results were the same/very similar to the previous implementation (no WIKI_BONUS).
    Eg. For search word wikimedia, wikipedia.org appeared lower than quickbooks.intuit.com.


Finding the right value for NUM_SLASHES_BONUS:
I experimented with both different values of NUM_SLASHES_BONUS as well as different buckets (of '/' count):

  • Bonus = 2, Buckets = {0-2, 3-4, 5-7, 8+}
  • Bonus = 1, Buckets = {0-1, 2-4, 3-6, 7+}
  • Bonus = 1, Buckets = {0, 1-2, 3-4, 5+}
  • Bonus = 0.5, Buckets = {0, 1, 2, 3+}
  • Bonus = 0.5, Buckets = {0-1, 2-4, 5-6, 7+}
  • Bonus = 0.5, Buckets = {0-2, 3-4, 5+}
Observations:
  • Any value greater than 0.5 was significantly scoring CLD/root page URLs higher than any nested pages, sometimes overshooting even more relevant nested pages.
    Eg. Ideally, on searching for iphone, all apple.com/products/..., apple.com//support/..., apple.com//iphone-/specs, etc. URLs should be high on the list. With the NUM_SLASHES implementation, root URLs such as verizon.com, ebay.com, grammarly.com, etc. were ranking significantly higher than most apple.com/... URLs.

  • The first segregation of '/' count was (again) pushing CLD URLs higher than all nested paths, even if the latter were more relevant to the keyword.

  • The second segregation of '/' count gave better results than any of the other tested divisions. URLs closer to the root page (such as verizon.com/deals, verizon.com/home/internet, verizon.com/solutions-and-services for keyword verizon) appeared before deeper nested URLs (such as verizon.com/home/accessories/cables-connectors).

  • The final segregation of '/' count also gave close results. However, there were a few times when a deeper-nested URL appeared above its parent URL (lower down in the SERP).
    Eg. On searching for tokyo, the second set of results pushed www.apple.com/am/business/enterprise/success-stories/transportation below www.apple.com/am/business/enterprise/success-stories/transportation/tokyo-metro.


Modifying doc_id:
The letter_code value (the 17th byte of doc_id) is modified to reflect the following information:

  • Bit 8: represents whether the URL is a CLD or not
  • Bits 4,5,6,7: represent the doc type (mapped between 0-8)
  • Bit 3: represents whether the URL points to a wikipedia page or not
  • Bits 1,2: represent the number of slashes