Project Blog:

March 8, 2016: Updates.

Worked on
  • fixed the raw_data file naming convention to unique file name (fname_docscount_timestamp)
  • changed the k-means output data dir to yioop/work_directory/k_means_data
Points to discuss
  • yioop web crawling does not work properly, robots.txt denial issue
  • There is no need of alternate m-way merge, the level 2 merging serves the purpose
Questions
  • Question 1: While integrating K-means code with Yioop, we have 2 options
    option1: save the { doc => summary } raw data on the disk, and then feed it to the K-means [ consumes little time but gives us access to raw data to be fed to K-means ]
    option2: feed the {doc => summary } in memory directly to K-means [saves time but we won't have access to raw_data later]
  • Question 2: Labelling and Feature Selection parts are still naive (can be improved in post-project work)
TO DOs
  • TODO 1: make a patch for changes in yioop/src/library/IndexArchieveBundle.php ( k-means call in initGenerationToAdd() and new API getDocsSummaryOfWord() )
  • TODO 2: proper incremental merge in k-means (assumption is the 1st ever k-means run set the features, which will remain same for rest of upcoming k-means runs. Can be improved, post-project)
  • TODO 3: use media_updater to call k_means() in initGenerationToAdd()
  • TODO 4: Display the k-means output to UI
  • Report Writing
Post Meeting Tasks
  • Modify averaging logic of vectors in merge to weighted average

Dec 14, 2015: Project Meeting.

Will Work on:
  • logarithmic merging
  • store output in cluster data
  • make a summary data level for labelling
  • use media updater to call K-means
  • make a patch 1
  • Directory UI
  • make a patch 2
  • multi machine , make a patch 3

Nov 20, 2015: Updates.

  • Completed Level1 and Level2 K-means implementation and tested on smaller datasets of 2 files from Yioop, around 150 docs per input file
  • Question 1: While integrating K-means code with Yioop, we have 2 options
    option1: save the { doc => summary } raw data on the disk, and then feed it to the K-means [ consumes little time but gives us access to raw data to be fed to K-means ]
    option2: feed the {doc => summary } in memory directly to K-means [saves time but we won't have access to raw_data later]
  • Question 2: Labelling and Feature Selection parts are still very naive, need to discuss on this.
  • Question 3: In Yioop's index switch call, even though I am checking the condition that if ((current_num_docs + add_num_docs) > num_docs_per_gen ) then save the shard, the call to getDocsSummaryofWord() only returns very less amount of docs at first few shards, in later index switches it shows reasonable num of records in doc => summary.
    So, shall I consider these small data_sets as well for K-means level-1 or ignore them and only consider bigger datasets?
  • TODO 1: In K-means-level-2( ) function, at the end, do following steps:
    step1: zip the details and level-1-output chunks, features, and k_means_count files in the folder K-means-round#
    step2: delete the details and level-1-output chunks, features, and k_means_count files
    This indicated that Level1 + Level2 K-means round is finished and we are resetting the data directories to the next round
  • TODO 2: Off the Yioop, make 6 or more chunks of default index_shard sized { data => summary } , approximately it should have ~35K entries
    run K-Means Level 1 and Level 2 on these
    benchmark the time and quality
  • TODO 3: Integrate the K-means with Yioop based on Question 1's answer and benchmark time and quality
  • TODO 4: Run in distributed mode
  • Report Writing

Nov 2, 2014: Project Meeting.

Worked on:
  • Non-yioop dataset for Level-2 K-means
  • Getting Yioop V3 running on mac
Will Work on:
  • Get Yioop V3 running on mac
  • Get 20 index_shards of 400 docs from Yioop
  • Run Level-2 on 20 shards collected
  • Deadline November 20 to complete the Level-2 code syncing it in Yioop codebase

Sep 28, 2015: Project Meeting.

  • Discussed K-means implementation: add an extra dimension as "unknown" to avoid null vectors
  • TODO: update the Key Deliverables points: mention the clear outcome for scalable version of K-means for Yioop
  • TODO: use PHP7 to expect the faster performance of K-means
  • Discussed the merging logic of K-means in Yioop - [ level1 : create M number of K-means for M index shards ] [ level2 - create global K-means on M K-means clusters as input]

Sep 2, 2014: Project Meeting.

  • Discussed about Naïve Bayes classifier.
  • Decided to write binary classifier to classify emails as personal or non-personal.

Sep 9, 2014: Project Meeting.

  • Finalized proposal for CS297.
  • Shown program to extract email body from email account.
  • Discussed about how to classify emails into personal or non-personal using binary classifier.

Sep 16, 2014: Project Meeting.

  • Discussed about Naive Bayes implementation for Email Spam classification.
  • Heads up on probability generation for various tokens.
  • Decided to to use smoothing for out of category words and try it on 300 emails.

Sep 23, 2014: Project Meeting.

  • No meeting (Job Fair)

Sep 30, 2014: Project Meeting.

  • Discussed how to do the implementation in detail.
  • Decided to get some basic implementation next time

Oct 7, 2014: Project Meeting.

  • Got Advised to refer to previous students for implementation of Naive Bayes algorithm.
  • Got guidance to refer AI notes and Artificial-Intelligence-A-Modern-Approach book.
  • Make slides deck on hierarchical clustering.

Oct 15, 2014: Project Meeting.

  • Showed demo of Naive Bayes' Classifier on very small data, advised to demo it on 30+10 email dataset
  • Presented HAC(Hierarchical Agglomerative Clustering) slide deck
  • Next meeting basic implementation of HAC required
  • Discussion on actual development module of Yioop's Crawler + Indexer section
  • Brainstormed the idea of incremental MapReduce

Oct 29, 2014: Project Meeting.

  • No progress was shown because of personal reasons.

Nov 4, 2014: Project Meeting.

  • Brainstormed the HAC implementation in terms of Data Structures
  • Decided to go with Text Vector concept to compute similarity between two Texts i.e. cosine similarity concept
  • Was informed that there will be no meeting on Nov 11, 2014 because of Veterans Day

Nov 18, 2014: Project Meeting.

  • Discussed the implementation difficulties in HAC in terms of usage time complexity i.e. priority queue VS naive approach
  • Clarified the implementation using concrete example

Nov 25, 2014: Project Meeting.

  • Implemented and demoed the HAC on small set of documents
  • Was instructed to change the code so that it works for K # of clusters
  • Was instructed to understand and make the report on recipe plugin and manage classifier code in Yioop's codebase
  • Was instructed to add Naive Bayes' and Hierarchical Clustering Algorithm folders on the BLOG

Dec 2, 2014: Project Meeting.

  • Complete the #3 deliverable i.e. a report on understanding Recipe-plugin in Yioop
  • Add HAC(Hierarchical Agglomerative Clustering) presentation in form of PDF on Home Page
  • Add links of Naive Bayes Classifier and Hierarchical Agglomerative Clustering work to Home Page
  • Add all deliverables to Home Page
  • Add CS297 Report (6 pages) to Home Page

Feb 3, 2014: Project Meeting.

  • Discussed the basic roadmap of the project
  • Clone the Yioop REPO and get the local copy working for crawling and indexing
  • Understand the working of INDEXING process, especially INDEX SWITCH and SHARDS MERGING
  • Study the INDEXED data format and write a basic Hierarchical Clustering code for it
  • Plan the data format for serialize-deserialize of Hierarchy of Clusters.