Project Blog:

March 8, 2016: Updates.

Worked on

fixed the raw_data file naming convention to unique file name (fname_docscount_timestamp)
changed the k-means output data dir to yioop/work_directory/k_means_data

Points to discuss

yioop web crawling does not work properly, robots.txt denial issue
There is no need of alternate m-way merge, the level 2 merging serves the purpose

Questions

Question 1: While integrating K-means code with Yioop, we have 2 options
option1: save the { doc => summary } raw data on the disk, and then feed it to the K-means [ consumes little time but gives us access to raw data to be fed to K-means ]
option2: feed the {doc => summary } in memory directly to K-means [saves time but we won't have access to raw_data later]
Question 2: Labelling and Feature Selection parts are still naive (can be improved in post-project work)

TO DOs

TODO 1: make a patch for changes in yioop/src/library/IndexArchieveBundle.php ( k-means call in initGenerationToAdd() and new API getDocsSummaryOfWord() )
TODO 2: proper incremental merge in k-means (assumption is the 1st ever k-means run set the features, which will remain same for rest of upcoming k-means runs. Can be improved, post-project)
TODO 3: use media_updater to call k_means() in initGenerationToAdd()
TODO 4: Display the k-means output to UI
Report Writing

Post Meeting Tasks

Modify averaging logic of vectors in merge to weighted average

Dec 14, 2015: Project Meeting.

Will Work on:

logarithmic merging
store output in cluster data
make a summary data level for labelling
use media updater to call K-means
make a patch 1
Directory UI
make a patch 2
multi machine , make a patch 3

Nov 20, 2015: Updates.

Completed Level1 and Level2 K-means implementation and tested on smaller datasets of 2 files from Yioop, around 150 docs per input file
Question 1: While integrating K-means code with Yioop, we have 2 options
option1: save the { doc => summary } raw data on the disk, and then feed it to the K-means [ consumes little time but gives us access to raw data to be fed to K-means ]
option2: feed the {doc => summary } in memory directly to K-means [saves time but we won't have access to raw_data later]
Question 2: Labelling and Feature Selection parts are still very naive, need to discuss on this.
Question 3: In Yioop's index switch call, even though I am checking the condition that if ((current_num_docs + add_num_docs) > num_docs_per_gen ) then save the shard, the call to getDocsSummaryofWord() only returns very less amount of docs at first few shards, in later index switches it shows reasonable num of records in doc => summary.
So, shall I consider these small data_sets as well for K-means level-1 or ignore them and only consider bigger datasets?
TODO 1: In K-means-level-2( ) function, at the end, do following steps:
step1: zip the details and level-1-output chunks, features, and k_means_count files in the folder K-means-round#
step2: delete the details and level-1-output chunks, features, and k_means_count files
This indicated that Level1 + Level2 K-means round is finished and we are resetting the data directories to the next round
TODO 2: Off the Yioop, make 6 or more chunks of default index_shard sized { data => summary } , approximately it should have ~35K entries
run K-Means Level 1 and Level 2 on these
benchmark the time and quality
TODO 3: Integrate the K-means with Yioop based on Question 1's answer and benchmark time and quality
TODO 4: Run in distributed mode
Report Writing

Nov 2, 2014: Project Meeting.

Worked on:

Non-yioop dataset for Level-2 K-means
Getting Yioop V3 running on mac

Will Work on:

Get Yioop V3 running on mac
Get 20 index_shards of 400 docs from Yioop
Run Level-2 on 20 shards collected
Deadline November 20 to complete the Level-2 code syncing it in Yioop codebase

Sep 28, 2015: Project Meeting.

Discussed K-means implementation: add an extra dimension as "unknown" to avoid null vectors
TODO: update the Key Deliverables points: mention the clear outcome for scalable version of K-means for Yioop
TODO: use PHP7 to expect the faster performance of K-means
Discussed the merging logic of K-means in Yioop - [ level1 : create M number of K-means for M index shards ] [ level2 - create global K-means on M K-means clusters as input]

Sep 2, 2014: Project Meeting.

Discussed about Naïve Bayes classifier.
Decided to write binary classifier to classify emails as personal or non-personal.

Sep 9, 2014: Project Meeting.

Finalized proposal for CS297.
Shown program to extract email body from email account.
Discussed about how to classify emails into personal or non-personal using binary classifier.

Sep 16, 2014: Project Meeting.

Discussed about Naive Bayes implementation for Email Spam classification.
Heads up on probability generation for various tokens.
Decided to to use smoothing for out of category words and try it on 300 emails.

Sep 23, 2014: Project Meeting.

No meeting (Job Fair)

Sep 30, 2014: Project Meeting.

Discussed how to do the implementation in detail.
Decided to get some basic implementation next time

Oct 7, 2014: Project Meeting.

Got Advised to refer to previous students for implementation of Naive Bayes algorithm.
Got guidance to refer AI notes and Artificial-Intelligence-A-Modern-Approach book.
Make slides deck on hierarchical clustering.

Oct 15, 2014: Project Meeting.

Showed demo of Naive Bayes' Classifier on very small data, advised to demo it on 30+10 email dataset
Presented HAC(Hierarchical Agglomerative Clustering) slide deck
Next meeting basic implementation of HAC required
Discussion on actual development module of Yioop's Crawler + Indexer section
Brainstormed the idea of incremental MapReduce

Oct 29, 2014: Project Meeting.

No progress was shown because of personal reasons.

Nov 4, 2014: Project Meeting.

Brainstormed the HAC implementation in terms of Data Structures
Decided to go with Text Vector concept to compute similarity between two Texts i.e. cosine similarity concept
Was informed that there will be no meeting on Nov 11, 2014 because of Veterans Day

Nov 18, 2014: Project Meeting.

Discussed the implementation difficulties in HAC in terms of usage time complexity i.e. priority queue VS naive approach
Clarified the implementation using concrete example

Nov 25, 2014: Project Meeting.

Implemented and demoed the HAC on small set of documents
Was instructed to change the code so that it works for K # of clusters
Was instructed to understand and make the report on recipe plugin and manage classifier code in Yioop's codebase
Was instructed to add Naive Bayes' and Hierarchical Clustering Algorithm folders on the BLOG

Dec 2, 2014: Project Meeting.

Complete the #3 deliverable i.e. a report on understanding Recipe-plugin in Yioop
Add HAC(Hierarchical Agglomerative Clustering) presentation in form of PDF on Home Page
Add links of Naive Bayes Classifier and Hierarchical Agglomerative Clustering work to Home Page
Add all deliverables to Home Page
Add CS297 Report (6 pages) to Home Page

Feb 3, 2014: Project Meeting.

Discussed the basic roadmap of the project
Clone the Yioop REPO and get the local copy working for crawling and indexing
Understand the working of INDEXING process, especially INDEX SWITCH and SHARDS MERGING
Study the INDEXED data format and write a basic Hierarchical Clustering code for it
Plan the data format for serialize-deserialize of Hierarchy of Clusters.