CS298 Proposal
Text Summarization using SVD (Singular Value Decomposition) and Lanczos Algorithm
Youn, Kim (youn.kim@students.sjsu.edu)
Advisor: Dr. Chris Pollett
Committee Members: Dr. Jon Pearce(pearce@cs.sjsu.edu) and Dr. Tseng, Chris(tseng@cs.sjsu.edu)
Abstract:
When browsing the Internet, one can easily get overwhelmed by the flood of text available. No one wants to waste time reading long texts on a web page.
By summarizing a website, one can quickly read an accurate representation of its contents and save time by knowing what is worth reading and what is not.
For this reason, an efficient text summarizer is needed that will provide a non-redundant extract from the original site.
All the major search engines (Google, Yahoo, etc.) use automated text summarization and present condensed descriptions of the search results.
My project also deals with text summarization. Though many summarization methods exist, I will make use of the Lanczos algorithm for the computation of a few eigenvalues and eigenvectors of a large sparse matrix and SVD (Singular Value Decomposition), taking a high-dimensional set of data and reducing it to a lower-dimensional set of data.
This makes it possible to identify the best approximation of the original text by using the information acquired from the Lanczos algorithm earlier.
For CS 298, I will extend the deliverables I developed in CS 297 to create a program that generates a text summary of a web site.
Since I will deal with large data sets of words and sentences from different pages in a web site, it is natural that I will use a database.
Thus I will create SQL scripts that will calculate SVD using the Lanczos algorithm, and will be able to extract a summary from a web site.
CS297 Results
- I learned how to calculate SVD, eigenvalues, and eigenvectors.
- I learned how to convert a symmetric matrix into a tridiagonal matrix by implementing the Lanczos algorithm.
- I learned to incorporate the Lanczos algorithm into calculating SVD and different techniques to make my implementation more stable.
- I learned how all of the previous deliverables could be put together to reach my final objective-text summarization.
Proposed Schedule
Week 1 - 2:
08/25 - 09/08 | Research and improve the numerical stability about Lanczos algorithm and QR algorithm. |
Week 3:
09/09 - 09/15 | Deliverable 1: Implement Lanczos algorithm in SQL |
Week 4 - 5:
09/16 - 09/29 | Deliverable 2: Implement SVD in SQL |
Week 6 - 7:
09/30 - 10/13 | Deliverable 3: Implement a text summarizer that generates a text summary of a web site |
Week 8 - 9:
10/14 - 10/27 | Work on Deliverable 3 |
Week 10 - 11:
10/28 - 11/10 | Work on Deliverable 3 |
Week 12 - 13:
11/11 - 11/24 | Work on the CS 298 Final Report |
Week 14:
11/25 - 12/01 | Complete and submit draft report for committee members, Prepare presentation slides |
Week 15:
12/02 - 12/08 | Prepare presentation slides |
Week 16:
12/09 - 12/16 | Oral Defense |
Key Deliverables:
- Software
- Deliverable 1: Implement Lanczos algorithm in SQL
- Deliverable 2: Implement SVD in SQL
- Deliverable 3: Implement a text summarizer that generates a text summary of a web site
- Report
- Detailed report on software deliverables
- Final report and presentation
Innovations and Challenges
- Currently, no one has implemented calculating SVD using Lanczos algorithm in SQL and used it in text summarization.
- Dealing with very large sets of data is a challenge.
- Implementing SVD using Lanczos algorithm in SQL is a huge challenge.
References:
[1983] Golub, G. H. and C. F. Van Loan, Matrix Computations, The John Hopkins University Press, Baltimore, 1983.
[2002] Cullum, J., and Willoughby, R. (2002). Lanczos Algorithms for Large Symmetric Eigenvalue Computations Vol.1: Theory. SIAM, Philadelphia.
|