Report On Recipe Plugin And How To Scale Out

Aim

To understand the working of Recipe Plugin and report the scale out solution

Introduction

RecipePlugin causes Yioop during a crawl to run the code in indexing_plugins/recipe_plugin.php. This code tries to detect pages which are food recipes and separately extracts these recipes and clusters them by ingredient. It then add search meta words ingredient: and recipe:all to allow one to search recipes by ingredient or only documents containing recipes.

Implementation

We will discuss the code for the recipe plugin here. The recipe plugin is used to detect food recipes which occur on pages during a crawl. It creates "micro-documents" associated with found recipes. These are stored in the index during the crawl under the meta-word "recipe:all". After the crawl is over, the recipe plugin's postProcessing method is called. It looks up all the documents associated with the word "recipe:all". It extracts ingredients from these and does clustering of recipes based on ingredient. It finally injects new meta-words of the form "ingredient:some_food_ingredient", which can be used to retrieve recipes most closely associated with a given ingredient. In "ingredient:some_food_ingredient", some_food_ingredient is one of the clusters. RecipePlugin subclasses IndexingPlugin and implements the following four methods: getProcessors() pageProcessing($page, $url) postProcessing($index_name) getAdditionalMetaWords()

Description

During a web crawl, after a fetcher has downloaded a batch of web pages, it uses a page's mimetype to determine a page processor class to extract summary data from that page. In Yioop, there are following processors,

As a crawl proceeds, recipe plugin will typically be called to do further processing of a page only in addition to some of these processors. A page processor has a method handle($page, $url) called by Yioop with a string $page of a downloaded document and a string $url of where it was downloaded from. This method first calls the process($page, $url) method of the processor to do initial summary extraction and then calls method pageProcessing($page, $url) of recipe_plugin associated with the given processor.

A pageProcessing($page, $url) method is expected to return an array of subdoc arrays found on the given page. The resulting "micro-document" is inserted by Yioop into the index under the word nameofplugin:all . After the crawl is over, Yioop will call the postProcessing($index_name) method of each indexing plugin that was in use. The recipe plugin does searches of the index and uses the results of these searches to inject new meta-words into the index. In order for Yioop to be aware of the meta-words you are adding, you need to implement the method getAdditionalMetaWords().

Probable Problem and Scale Out Idea

Most essentially the core of recipe plug in based on the concept of clustering using Kruskal's Algorithm. The idea is simple, minimum spanning tree using Kruskals algorithm is created initially from the given graph. The minimum spanning tree obtained by applying Kruskals algorithm is used to create the required number of clusters. Clusters are formed by removing the most expensive edge from the minimum spanning tree. The number of edges to be removed is one less than the required number of clusters.

In actual application to Yioop, the graph's vertices indicates the pages or documents, and edges with weights represent how pages are linked in terms of common keywords. For example, A-B represents an edge of weight 50, which means page A and page B have 50 common words. There is going to be serious problem, if the graph size becomes extremely huge and it won't fit in memory. To create the graph of given pages, and mark the edges with accurate wights is time consuming pre-processing. After that, we need to find MST using Kruskal's algorithm. This will slow down the process of clustering. We need an alternative which can be used to scale out the clustering efficiently. We can use Hierarchical Agglomerative Clustering or some sort of external memory work around.