Chris Pollett > Students > Bocage

    Print View

    [Bio]

    [Blog]

    [CS280 Proposal]

    [CS280 Report]



























Advisor: Dr. Chris Pollett

Description:

I propose to modify the Yioop search engine to be capable of detecting the Content Management System (CMS) that created the content it is crawling. A CMS is a tool “for building and maintaining web applications for many disciplines” (Mooney & Baenziger, 2007). In other words, a CMS helps organizations stand up web content quickly with virtually no web programming experience. For example, the Herald Sun, a news source from Australia, uses a CMS called WordPress. In addition, “WordPress is used by 25.4% of all the websites” (W3Techs, 2015). WordPress offers its users many plugins, themes, site management functionality et cetera for them to publish content and apply a consistent layout.

When a search engine crawls a page, it usually extracts the most important parts to store in the index. By detecting which CMS a web page uses you can find the most important content. The Yioop search engine as it crawls now does not detect which CMS a web page uses. Accurately targeting the important content will allow the Yioop search engine to produce better summaries of the content it has crawled. That is because most CMS developed web pages follow a specific pattern to render the page in a browser. For example, the page content, navigation content, side-bar content et cetera are located in HTML tags that are decorated with the same names or attributes despite the content within the tags. Furthermore, the important content will also be located in an HTML tag that is common to all of its pages.

My plan for the semester is to create an abstract CMS detection framework that is flexible enough to detect virtually any CMS based on input from its users. The Yioop search engine users will configure the CMS detection settings through a new GUI Activity within the Yioop search engine that I will create. The GUI Activity will allow users to add/remove CMS detection without writing any code. I will also produce a before and after summary result comparison and a patch Dr. Pollett can include in his subsequent release of the Yioop search engine.

Deliverables:

The following will be done by the end of CS280:

1. A new GUI Activity in the Yioop search engine to allow users to manage its CMS detection settings

2. Extend the fetcher in the Yioop search engine to incorporate the user defined CMS detection settings

3. A short report comparing the summary results before adding CMS detection and after adding CMS detection

4. A patch to Yioop that implements my proposed changes

I will be meeting with Dr. Pollett once a week to discuss my progress and any problems that I run into.

References:

[Mooney2007] Extensible open source content management systems and frameworks: a solution for many needs of a bioinformatics group. S. D. Mooney, P. H. Baenziger. Oxford Journals, 9(1), 69-74. 2007

[W3Techs2015] Usage of content management systems for websites. W3Techs. . 2015