Chris Pollett > Students >
Gargi

    ( Print View)

    [Bio]

    [Blog]

    [C297 Proposal]

    [Paper 1: Large-scale IRLBot crawl PDF]

    [Paper 2: Distributed Crawler Architecture PDF]

    [Paper 3: Scalability Challenges PDF]

    [Paper 4: High Performance Priority Queues PDF]

    [Deliverable 1: Yioop Ranking Mechanisms PDF]

    [Deliverable 2(ii): Modifying Yioop's UI Editor]

    [Deliverable 3: Modifying Yioop's queuing process]

    [Deliverable 4: Yandex Signal PDF]

    [CS 297 Report PDF]

    [C298 Proposal]

    [C298: Yandex-inspired Search Factors]

    [C298: Latest Page Version in SERP]

    [C298: Disjunctive Queries in Yioop Search]

    [CS 298 Report PDF]

    [CS 298 Report Slides PDF]

Modify Search Logic in Yioop to use Disjunctive Queries (CS298 Deliverable#3)

Aim:

  • Converting the current query processing logic to create disjunctive queries instead of conjunctive ones.
  • Using a heap to maintain the top k documents seen so far while fetching results for a search query.
  • Calculating the maxScore associated with each term in the search query and deleting those that cannot make it to the top k results to improve on response time.

Changes made in this Deliverable:

  1. Disjunctive Queries
    • The modifications made to the codebase for this change are all in models/PhraseModel::getPhrasePageResults().
    • The objective of this change is to use disjunct, i.e., OR queries instead of the current conjunct (AND) logic to separate search terms.
    • For example, a search for justin trudeau lang:en safe:true would become justin lang:en safe:true OR trudeau lang:en safe:true OR justin-trudeau lang:en safe:true.
    • To supplement these changes, the input search phrase is first passed through parseWordStructDisjunctiveQuery(), which separates the string into multiple disjunct search strings, which are each processed as individual search queries and collated in UnionIterator.

    • PhraseModel::parseWordStructDisjunctiveQuery():
      • $search_terms stores all of the disjunct query strings.
      • Any text occurring between quotes (& quot;) is first extracted from the query into a separate string.
      • Further, any words separated by an ampersand (& amp;)- indicating a conjunctive query- are extracted into a separate string.
      • Then the remaining terms in the search query are checked. If any meta words are encountered, they are appended to each of the disjunct queries found.
      • Once all of the disjunct queries have been formed, the original search query is passed through PhraseParser::extractPhrases(), and the search terms in the original query are modified accordingly.
      • Finally, the search string is separated by whitespaces for terms to be treated as individual query strings.
      • PhraseModel::guessSemantics() appends any other detected meta words to the disjunct queries.

    • Once all of the disjunct phrases are created, they replace the previous $disjunct_phrases variable.
    • Each of them is passed into PhraseModel::parseWordStructConjunctiveQuery().
    • This overlapping logic (of guessing semantics and extracting phrases) is now removed from parseWordStructConjunctiveQuery().
    • A Word/IntersectIterator is created for each disjunct query as usual.


  1. Max Heap
    • The modifications made to the codebase for this change are all in library/index_bundle_iterators/UnionIterator::findDocsWithWord().
    • Reference
    • $results_heap maintains the top k documents found in a max heap, arranged by their relevance scores.
    • For each set of $docs found on every iterator->currentDocsWithWord() in the UnionIterator, those that can possibly make it into the top k documents are maintained in the heap.
    • The maximum size of the heap (i.e., k) is $results_per_block.
    • If the heap isn't full, the doc is automatically added to the heap. If full, the score of the doc is compared with the current minimum score in the heap; (only) if greater, it replaces the minimum doc.
    • On each insertion, the heap is re-heaped via UnionIterator::heapifyUp() to maintain the max heap property.
    • If the heap becomes full after the insertion of a document, we keep track of the minimum score observed in the current heap and its index to compare the next document with.
    • Once the results for each of the nested iterators is fetched, the top k docs are extracted from the heap one at a time through UnionIterator::extractMaxScoringDoc() and stored in $pages.
    • After each removal of the current top doc in the heap, UnionIterator::heapifyDown() is triggered to retain the max heap property.


  1. MaxScore
    • The modifications made to the codebase for this change are all in library/index_bundle_iterators/UnionIterator::findDocsWithWord().
    • Reference
    • The UnionIterator instance now accepts parameters for the current index name and total number of crawled documents stored in the index (N). Both of these parameters are used to calculate maxScores.
    • First, we iterate over all of the nested iterators in the current UnionIterator instance. Every nested iterator is either an IntersectIterator or a WordIterator. We make a list of all of the word_keys (i.e., the processed search terms) present on each of them in $query_terms.
    • This list is created in UnionIterator::getQueryTerms().
    • For each of the word keys, we store both the maxScore it could possibly contribute to the final set of results, as well as the index of the iterator it appears on.
    • The idea behind calculating the maxScore is that we find an upper bound on the maximum relevance score any document containing that term could have. Once the results max heap is full, we can compare that maxScore with the score of the kth-best scoring doc encountered yet. If lesser, we can safely assume that any document containing only the current query term can never make it to the top k, and thus remove that iterator entirely.
    • The total number of documents in the current index (to find the value of N to compute maxScores) is fetched from library/IndexManager::discountedNumDocsTerm(). This function has been modified to accept a flag param $discount_terms. This flag indicates whether terms should be discounted based on their generation or not, which we set to false to get the total count of occurrences of the current query term across all documents.
    • Once we have a list of query term => maxScore, we use this while inserting documents into the max heap.
    • Each time the heap is full (while iterating over $docs for each nested iterator), we compare the current kth-best score (i.e., the heap's current minimum score) with each of the maxScores while iterating over $query_terms.
    • If the maxScore of a term is not greater than the current heap-minimum, we can safely delete the word iterator associated with it from the union iterator instance.