Chris Pollett > Students >
Gaurang

    ( Print View )

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [Deliverable1]

    [Deliverable2]

    [Deliverable3]

    [CS297 Report - PDF]

    [CS298 Proposal]

    [CS298 Report - PDF]

    [Download Project]

                          

























CS298 Proposal

Automated article generation using the web

Gaurang Patel (gaurangtpatel@gmail.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Cay Horstmann, Dr. Mark Stamp

Abstract:

Search engines can search for useful information and present it in the form of a list of web links to the user. But search engines do not organize information from different websites into a coherent resource like an article. An article generation application is an intelligent mining engine that looks for the web content, combines and organizes the information in a meaningful way to generate an article. For CS297/CS298 project, we will make such an article generation tool. This tool will provide free articles to people based on their requirements. It will generate article on the topic entered by the user using information available on the web. Generated articles will be in an electronic format i.e., web material/tutorial in form of web content. The articles will have well defined sections, each talking about different aspect of the topic.

Separate modules of web crawler and clustering were developed during CS297. Integration of these modules to have the final article generation system will be the primary goal for CS298 coursework.

CS297 Results

  • Made Nutch web crawler to crawl the web.
  • Configured, built and used OTS (Open Text Summarizer) to achieve summarizing goal.
  • Make the Carrot2 clustering work for document contents clustering.
  • CS297 Report

Proposed Schedule

Week 1-2: Aug 24 - Sept 6CS298 proposal. Start working on deliverable 1.
Week 3-4: Sept 07 - Sept 20Deliverable 1.a
Week 5-6: Sept 21 - Oct 04Deliverable 1.b
Week 7-8: Oct 05 - Oct 18Deliverable 1.c. Finish up with Deliverable 1.
Week 9-10: Oct 19 - Nov 01Start working on Deliverable 2. Read research papers, articles.
Week 11-12: Nov 02 - Nov 15Deliverable 2
Week 13-14: Nov 16 - Nov 29Finish deliverable 2
Week 15-16: Nov 30 - Dec 13CS298 report
Week 17: Dec 14 - Dec 20Defense

Key Deliverables:

  • Software
    1. Integration of nutch crawler, searcher, OTS and clustering engine. Output of this deliverable will be the ready to use article generation website. This deliverable comprises of following sub deliverables:
      1. Website Front end development in PHP, HTML, CSS.
      2. Crawl the whole web using nutch to have the crawled results ready for the system.
      3. Backend implementation. Combine Lingo (clustering algorithm used by Carrot2) and OTS algorithms to create article sections.
      4. Article generation website. User will be able to enter topic and see the generated article.
    2. Enhance quality of the article generation engine if needed. Possible ways include mixture of any of the following:
      1. Look for additional algorithms to be considered into the system integration.
      2. Look for the techniques of noise reduction.
  • Report
    1. Detailed report on software deliverables
    2. Final report and presentation

Innovations and Challenges

  • Integration of OTS and Lingo algorithm is going to be challenging as different languages have been used in development of these algorithms.
  • Currently there are no automated article generators available. So there are no efficiency measures available to check engine quality against.

References:

[2007] Web Page Analysis: Experiments Based on Discussion and Purchase Web Patterns. Jana Kocibova, Karel Klos, Ondrej Lehecka, Milos Kudelka, Vaclav Snasel. 2007 IEEE/WIC/ACM International Conferences on Web Intelligence

[2007] A Method for Integration of Web Applications Based on Information Extraction. Hao Han and Takehiro Tokuda. Eighth International Conference on Web Engineering.

[2007] A Novel Method for Hierarchical Clustering of Search Results. Gang Zhang Yue Liu Songbo Tan Xueqi Cheng. 2007 IEEE/WIC/ACM International Conferences on Web Intelligence