Chris Pollett > Students >
Gaurang

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Deliverable1]

    [Deliverable2]

    [Deliverable3]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Report - PDF]

    [Download Project]

                          

























CS297 Proposal

Automated article generation using the web

Gaurang Patel (gaurangtpatel@gmail.com)

Advisor: Dr. Chris Pollett

Description:

The web is a huge source of information, but the contents on the web are not organized. Search engines can search for useful information and present it in the form of a list of web links to the user. But search engines do not organize information from different websites into a coherent resource like a book. An article generation application is an intelligent mining engine that looks for the web content, combines and organizes the information in a meaningful way to generate an article. For CS297/CS298 project, we will make such an article generation tool. It will provide free articles to people based on their requirements. It generates article on the topic entered by the user using information available on the web. Information retrieval, semantic web and information extraction approaches will be used to develop the application. Generated articles will be in an electronic format i.e., e-book or web material/tutorial in form of web content. The articles will have well defined sections. Each of the section will talk about different aspect of the topic. Different sections can be technically thought as different clusters found while searching the material on web. Sections can have subsections and subsections can have sub subsections and so on till the desired depth. Details of the topic and desired depth of information to be covered may be the input parameters of the system.

Schedule:

Week 1-2: Jan.25 - Feb.6Writing 297 proposal. Reading articles and research papers. Decide on a web crawler to use. For example, Nutch.
Week 3: Feb.7- Feb.13Get used to crawler and its APIs. Work on deliverable 1. Use crawler APIs to get data. Store data into database.
Week 4-5:Feb.14-Feb.27Deliverable 1 due.
Week 6-7: Feb.28-Mar.13Read articles and research papers for clustering techniques to eliminate noise. OR Look for various summarizers.
Week 8-9: Mar.14-Mar.28Read article to get ideas for algorithm implementation. Think of Organization (Merge, Combine) algorithm. OR Select a summarizer and learn about it.
Week 10: Mar.29-Apr.5Prepare for implementation of algorithm. OR enhance the summarizer.
Week 11-12: Apr.6-Apr.20Try to get basic algorithm running. Performance improvement can be taken care in CS298. Deliverable 2 due.
Week 13:Apr.20-May.4Web content generation. Article generation using web technologies.
Week 14: May.5-May.12Prepare for sample run of the system. Deliverable 3 due. Preparing CS297 Report.
Week 15:May.13-May.19Finish CS297 Report. Deliverable 4 due.

Deliverables:

The full project will be done when CS298 is completed. The following will be done by the end of CS297:

1. Download Nutch(web crawler) and test various sample crawl scenarios.

2. Download Open text summarizer (http://libots.sourceforge.net/) and enhance it to achieve the summarizing goal. Or develop an intelligent algorithm to generate article/book.

3. Make the carrot clustering work for clustering of document contents.

4. CS297 Report

References:

[2007] Web Page Analysis: Experiments Based on Discussion and Purchase Web Patterns. Jana Kocibova, Karel Klos, Ondrej Lehecka, Milos Kudelka, Vaclav Snasel. 2007 IEEE/WIC/ACM International Conferences on Web Intelligence

[2007] A Method for Integration of Web Applications Based on Information Extraction. Hao Han and Takehiro Tokuda. Eighth International Conference on Web Engineering.

[2007] A Novel Method for Hierarchical Clustering of Search Results. Gang Zhang Yue Liu Songbo Tan Xueqi Cheng. 2007 IEEE/WIC/ACM International Conferences on Web Intelligence