CS298 Proposal

A Fast Alogorithm For Data Mining

Aarathi Raghu (aarathi99@yahoo.com)

Advisor: Dr. Chris Pollett

Committee Members: Dr. Chris Pollett, Dr. T.Y.Lin, Dr. Mark Stamp

Abstract:

Data Mining is a growing field and a variety of algorithms have been proposed. The Apriori algorithm is a commonly used algorithm in data mining. Recently, Lin and Louie[Lin2002] have proposed using bitmaps to improve the speed of this algorithm. Further, Lin, Hu, and Louie[Lin2003],have proposed a way to look at association rules as a lattice. A basis for this lattice could be used to generate any association rule for a given dataset. However, algorithms for generating a basis for this lattice are potentially exponential time. We studied modifications of the Apriori algorithm and the lattice based algorithm in CS297. Implementation of these algorithms formed the three deliverables for CS297. We have successfully completed the three deliverables- the bitmap based apriori ,the disk-based improvement of the previous implementation, and the lattice based algorithm.The goal of this project is to implement and test this lattice generating algorithm for datasets which follow a particular kind of distribution, for example, a Zipf distribution. We are hoping that for different kinds of distributions the algorithm might run faster and also be easily improved. Based on our experiments, we intend to make improvements to this algorithm in CS298. We also intend to try out a binning mechanism to make data conform more to a particular distribution and run the lattice-algorithm against it.

CS297 Results

Implemented the Apriori algorithm using bitmaps
Implemented the disk-based apriori algorithm
Implemented the lattice-based algorithm

Proposed Schedule

Week 1: 1/25-1/31	Prepare CS298 proposal
Week 2-3: 2/1-2/14	Implement the Zipf distribution
Week 4-6: 2/15-3/8	Implement the data distribution checker
Week 7-9: 3/9-3/29	Implement program to bin data into a desired distribution.Run this data against the lattice algorithm.
Week 10-12: 3/30-4/20	Start writing report
Week 13: 4/20-4/27	Submit draft to committee
Week 14-15: 4/27-5/15	Prepare for presentation
Week 16: 5/15-5/20-	Final Presentation

Key Deliverables:

Software
- Implement a program that generates a data set with the Zipf distribution
- A program which checks which one of the following distributions a data-set conforms to. The distributions we will consider are the Zipf distribution, Pareto distribution(also known as Bradford distribution), power normal distribution,normal distribution, exponential distribution, and hypergeometric distribution.
- A program that bins data to a desired distribution to run against the lattice-algorithm
- Do experiments on rows in table for fixed number of columns versus number of association rules generated.
Report
- Detailed description of the architecture of software
- Detailed description of results obtained during the course of the semester

Innovations and Challenges

All of these algorithms run in exponential time. Finding out where they run in polynomial time is a challenge.
Choose a binning mechanism so that the associations still maintain their relevance is a challenge.
Implementing a program that distinguishes between chosen statistical distributions.

References:

[Ramakrishnan] R.Ramakrishnan and J.Gehkre. Fundamentals of Database Systems.McGraw-Hill, 2002

[Molina] H.Garcia-Molina, J.Ullman, and J.Widom.Database System Implementation.Prentice-Hall, 2000

[Agarwal1994] R.Agarwal, and R. Srikant. Fast Algorithms for Mining Association Rules. Proc. Intl. Conf. on Very Large Databases. pp1522-1534.

[Lin2003] T.Y.Lin, X.T.Hu, and E.Louie.Using Attribute Value Lattice to Find Frequent Itemsets. Data Mining and Knowledge Discovery: Theory,Tools and Technology. 2003.pp 28-36.

[Lin2002] T.Y.Lin, and Eric Louie. Finding Asscoiation Rules by Granular Computing: Fast Algorithm for finding association rules. Data Mining, Rough Sets and Granular Computing. 2002.pp 23-42.