Chris Pollett > Old Classes > CS267
( Print View )

Student Corner:
  [Grades Sec2]

  [Submit Sec2]

  [Class Sign Up Sec2]

  [Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Topics/Outcomes]
  [Outcomes Matrix]
  [Grading]
  [HW/Quiz Info]
  [Exam Info]
  [Regrades]
  [Honesty]
  [Additional Policies]
  [Announcements]

HW Assignments:
  [Hw1] [Hw2] [Hw3]
  [Hw4] [Quizzes]

Practice Exams:
  [Mid 1] [Mid 2] [Final]

HW1 Solutions Page

For the book exercises, these are solutions to the ones that were graded:
[Exercise 1.4] [Problem 1.6] [Exercise 2.3]

For the second part of the homework I accepted almost any reasonable write-up of the Yioop/Nutch installation procedure and of your queries. For the write-up explaining the order of your query results, I was expecting some description that involved tracing the code from where the web-app is first hit to where results get retrieved, and scored in the index. For Yioop this would be first it goes to index.php, then search_controller.php's processRequest and processQuery are called. From here PhraseModels's getPhrasePageResults results is called. It in turns calls getSummariesByHash which first calls getQueryIterator to get an index_bundle_iterator representing the query. It iterates over this iterator to get a batch of scored results (about 200), it then sorts these, and makes a request to get the summaries of the top 10 to return for snippet extraction in getPhrasePageResults. The iterator built by getQueryIterator for a conjunctive query consists of one WordIterator for each term, which live in a list in an IntersectIterator for handling the conjunction, which is a field of a GroupIterator for grouping documents and links to documents with the same hash. These each have a findDocsWithWord method used to retrieve information about a collection of documents. Scoring would initial be done in word iterator versions of this method, these would be then summed in different ways in the intersect and group iterators. Exploring a little further word iterators look up information about posting lists in their constructor's by making calls to IndexManager to get an index by a timestamp. This returns an IndexBundleArchive. The IndexDictionary on this bundle has a getWordInfo method which is used to find an array of offsets into a sequence of IndexShard posting lists. The WordIterator keeps track of which shard it is working on, and uses calls to that shard's getPostingsSlice method to get items out of the index. getPostingsSlice also does initial scoring using IndexShard's makeItem method based on data from the crawl.

The third part of the homework was graded using this test file.

For the third part of this homework, I picked one of the better students' homeworks to be used as the homework solution. Each student that had their homework chosen will receive 1 bonus point after curving for having their homework selected. If you were chosen and would rather your homework not be used as the solution let me know and I will choose someone else homework and they will receive the bonus point instead. I modified the code by replacing tab characters with 3 spaces, deleting vertical stripes of *'s in comments, and tried to make the code follow the bracing convention in the Departmental Java Coding Guidelines. The documentation could still be improved. The @param and @return should have a brief description fo what the given input or output does. There is also an inner class whose methods need to be documented.


import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.TreeMap;


/**
   CS 267 -02  Homework #1
   
   Inverted Index Printer
   
   @author SJSU Students
   @version 1.0 Sep 11, 2012
 */
public class IndexPrinter
{
   /**
      This is the main method of the IndexPrinter class. 
      @param args    String[]
    */
   public static void main (String args[]) 
   {
      if(isOkToProceed(args))
      {
         try
         {
            List<String> contents = readFile(args[0]);
            Map<String, Word> sortedMap = countWords(contents);
            printResult(sortedMap);
         }
         catch(FileNotFoundException e)
         {
            System.err.println("File not found. Please input a valid file");
         }
      }
   }
    
   /**
      Checks if the user entered file is valid or not.

      @param args    String[]
      @return boolean
   */
   private static boolean isOkToProceed(String args[])
   {
      boolean value = true;
      if(args.length == 0 || null == args[0])
      {
         System.err.println("Invalid filename");
         value = false;
      }
      return value;
   }

    /**
        process each document and extracts words from it and creates a Word 
        object and stores the word and document counts.
      
        @param contents List<String>
        @return Map<String, Word>
     */
   private static Map<String, Word> countWords(List<String> contents)
   {
      Map<String, Word> sortedMap = new TreeMap<String, IndexPrinter.Word>();
      int docCount = 0;
      for (String content : contents) 
      {
         content = content.replaceAll("\n"," ");
         String[] values = content.split(" ");
         for (String value : values) {
            if(sortedMap.containsKey(value.trim()))
            {
               Word word = sortedMap.get(value.trim());
               word.setTotalCount(word.getTotalCount() + 1);
               word.getDocCount()[docCount]++;
               sortedMap.put(value.trim(), word);
            } 
            else 
            {
               Word word =new IndexPrinter().new Word();
               word.setTotalCount(1);
               word.setWord(value.trim());
               word.getDocCount()[docCount] = 1;
               sortedMap.put(value.trim(), word);
            }
         }
         docCount++;
      }
      return sortedMap;
   }

   /**
      Print the counts in the following format,
      number of documents, number of total occurrences 
      (document number, number of occurrences in that particular document),...

      @param sortedMap Map<String, Word>
   */
    private static void printResult(Map<String, Word> sortedMap)
   {
      Iterator<String> iter = sortedMap.keySet().iterator();
      while(iter.hasNext())
      {
         Word value = sortedMap.get(iter.next());
         int[] docCounts = value.getDocCount();
         int numberOfDocs = 0;
         StringBuilder frequenciesToBePrinted = new StringBuilder();
         int index = 0;
         for (int i : docCounts) 
         {
             index++;
             if(i > 0)
             {
                 numberOfDocs++;
                 frequenciesToBePrinted.append("(");
                 frequenciesToBePrinted.append(index);
                 frequenciesToBePrinted.append(",");
                 frequenciesToBePrinted.append(i);
                 frequenciesToBePrinted.append("),");
             }
         }
         frequenciesToBePrinted.deleteCharAt(
             frequenciesToBePrinted.length()  - 1);
         System.out.println(value.getWord() + "\n " + 
             numberOfDocs + "," +
             value.getTotalCount() + "," + frequenciesToBePrinted);
      }
   }

   /*
      This method reads the file, extract the documents from it and store each 
      document in a List object.

      @param fileName name of the file to read
      @return contents of the file
      @throws FileNotFoundException
   */
   private static List<String> readFile(String fileName) 
     throws FileNotFoundException
   {
      List<String> contents = new ArrayList<String>();
      Scanner scanner = new Scanner(new File(fileName));
      scanner.useDelimiter("\n\n");
      while(scanner.hasNext())
      {
         String text = scanner.next().trim();
         if(null != text && text.length() > 0)
         {
             contents.add(text);
         }
      }
      return contents;
   }

   /**
      This is a inner class to hold the Word data.
    
    */
   private class Word
   {
      private String word;
      private int totalCount = 0;
      private int[] docCount = new int[50000];
      public String getWord() 
      {
         return word;
      }
      public void setWord(String word) 
      {
         this.word = word;
      }
      public int getTotalCount() 
      {
         return totalCount;
      }
      public void setTotalCount(int totalCount) 
      {
         this.totalCount = totalCount;
      }
      public int[] getDocCount() 
      {
         return docCount;
      }
   }
}

Return to homework page.