CS298 Proposal

Text Summarization using SVD (Singular Value Decomposition) and Lanczos Algorithm

Youn, Kim (youn.kim@students.sjsu.edu)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Jon Pearce(pearce@cs.sjsu.edu) and Dr. Tseng, Chris(tseng@cs.sjsu.edu)

Abstract:

When browsing the Internet, one can easily get overwhelmed by the flood of text available. No one wants to waste time reading long texts on a web page. By summarizing a website, one can quickly read an accurate representation of its contents and save time by knowing what is worth reading and what is not. For this reason, an efficient text summarizer is needed that will provide a non-redundant extract from the original site. All the major search engines (Google, Yahoo, etc.) use automated text summarization and present condensed descriptions of the search results. My project also deals with text summarization. Though many summarization methods exist, I will make use of the Lanczos algorithm for the computation of a few eigenvalues and eigenvectors of a large sparse matrix and SVD (Singular Value Decomposition), taking a high-dimensional set of data and reducing it to a lower-dimensional set of data. This makes it possible to identify the best approximation of the original text by using the information acquired from the Lanczos algorithm earlier.
For CS 298, I will extend the deliverables I developed in CS 297 to create a program that generates a text summary of a web site. Since I will deal with large data sets of words and sentences from different pages in a web site, it is natural that I will use a database. Thus I will create SQL scripts that will calculate SVD using the Lanczos algorithm, and will be able to extract a summary from a web site.

CS297 Results

  • I learned how to calculate SVD, eigenvalues, and eigenvectors.
  • I learned how to convert a symmetric matrix into a tridiagonal matrix by implementing the Lanczos algorithm.
  • I learned to incorporate the Lanczos algorithm into calculating SVD and different techniques to make my implementation more stable.
  • I learned how all of the previous deliverables could be put together to reach my final objective-text summarization.

Proposed Schedule

Week 1 - 2: 08/25 - 09/08Research and improve the numerical stability about Lanczos algorithm and QR algorithm.
Week 3: 09/09 - 09/15Deliverable 1: Implement Lanczos algorithm in SQL
Week 4 - 5: 09/16 - 09/29Deliverable 2: Implement SVD in SQL
Week 6 - 7: 09/30 - 10/13Deliverable 3: Implement a text summarizer that generates a text summary of a web site
Week 8 - 9: 10/14 - 10/27Work on Deliverable 3
Week 10 - 11: 10/28 - 11/10Work on Deliverable 3
Week 12 - 13: 11/11 - 11/24Work on the CS 298 Final Report
Week 14: 11/25 - 12/01Complete and submit draft report for committee members, Prepare presentation slides
Week 15: 12/02 - 12/08Prepare presentation slides
Week 16: 12/09 - 12/16Oral Defense

Key Deliverables:

  • Software
    • Deliverable 1: Implement Lanczos algorithm in SQL
    • Deliverable 2: Implement SVD in SQL
    • Deliverable 3: Implement a text summarizer that generates a text summary of a web site
  • Report
    • Detailed report on software deliverables
    • Final report and presentation

Innovations and Challenges

  • Currently, no one has implemented calculating SVD using Lanczos algorithm in SQL and used it in text summarization.
  • Dealing with very large sets of data is a challenge.
  • Implementing SVD using Lanczos algorithm in SQL is a huge challenge.

References:

[1983] Golub, G. H. and C. F. Van Loan, Matrix Computations, The John Hopkins University Press, Baltimore, 1983.

[2002] Cullum, J., and Willoughby, R. (2002). Lanczos Algorithms for Large Symmetric Eigenvalue Computations Vol.1: Theory. SIAM, Philadelphia.