# Ronald Mak

Department of Computer Engineering
Department of Computer Science
Department of Applied Data Science
Spring Semester 2020

 Office hours: TuTh: 4:30-5:30 PM Office location: ENG 250 E-mail: ron.mak@sjsu.edu
Mission Control, Jet Propulsion Laboratory (JPL)
NASA Mars Exploration Rover Mission

# DATA 220 Mathematical Methods for Data Analysis

### Assignments

# Assigned Due Assignment
1 Jan 23 Jan 30 CSV datasets and Jupyter notebooks

Jupyter notebooks: TitanicCSV.ipynb    AirlineSafetyCSV.ipynb    BostonCrimeCSV.ipynb
CSV files: crimes-in-boston.zip
2 Jan 30 Feb 6 Seaborn Bar Charts of Random Values
3 Feb 6 Feb 13 Analysis of a Dataset

Example analysis: TitanicAnalysis.ipynb    TitanicSurvival.csv
4 Feb 13 Feb 20 Combinatorics and Probability Problem Set

Solutions: Assignment4-solutions.ipyn
5 Feb 20 Feb 27 Probability Problem Set

Solutions: Assignment5-solutions.ipyn
6 Feb 27 Mar 5 The Central Limit Theorem
7 Mar 19 Mar 26 Linear regression
8 Mar 26 Apr 9 Multiple Regression Analysis
9 Apr 9 Apr 16 Supervised and Unsupervised Machine Learning
10 Apr 16 Apr 23 Text analysis
11 Apr 25 Apr 30 Matrix Operations Problem Set

Solutions: Assignment11-solutions.ipyn
12 May 2 May 7 Polynomial Regression and Markov Chain Problem Set

Solutions: Assignment12-solutions.ipyn    PolynomialRegression.py

### Lectures

Week Date Content
1 Jan 23 Slides: Introduction to data analytics; What is Data Science? history of data collection; history of data analysis; Python libraries; datasets; load CSV files into dataframes; statistics and machine learning; data scientist skillset;

Lab: Install Anaconda
Lab: Load CSV datasets into dataframes
Jupyter notebooks: TitanicCSV.ipynb    AirlineSafetyCSV.ipynb    BostonCrimeCSV.ipynb
CSV files: crimes-in-boston.zip
2 Jan 30 Slides: Big Data; Jupyter notebooks; IPython; lists; indexing; length; mutable; two-dimensional; unpacking; sort; search; list comprehension; list operations; tuples; NumPy arrays; Seaborn; bar chart; random values

Lab: Seaborn Bar Charts
Jupyter notebooks: AgeBarChart.ipynb    5.02-Lists.ipynb    5.03-Tuples.ipynb    5.05-Slicing.ipynb    5.06-Deletion.ipynb    5.08-Sorting.ipynb    5.09-Searching.ipynb    5.10-OtherMethods.ipynb    5.12-Comprehensions.ipynb
3 Feb 6 Slides: Statistics; sums; measures of central tendency; mean; weighted average; median; mode; measures of variability; range; percentiles; quartiles; interquartile range (IRQ); variance; standard deviation; zip; data analysis with the Titanic Survival dataset

Lab: Descriptive Statistics
Jupyter notebooks: sum.ipynb    mean.ipynb    weighted_average.ipynb    median.ipynb    mode.ipynb    range.ipynb    quartiles.ipynb    IQR.ipynb    stdev.ipynb    ziptest.ipynb    TitanicAnalysis.ipynb    TitanicSurvival.csv    boxplot.png
4 Feb 13 Slides: Counting principles; factorial notation; count the complement; counting when order doesn't matter; binomial coefficients; collections that allow repetitions; permutations and combinations; uncertainty; probability: classical and relative frequency interpretations; basic probability laws; Venn diagrams

Lab: Combinatorics and probability problem set
5 Feb 20 Slides: Conditional probability; independent vs. dependent events; Bayes' Theorem; Thomas Bayes and Bayesian statistics; disease test example; Monty Hall Problem; discrete and continuous random variables; probability distributions: uniform, normal, exponential, binomial, Poisson; expected value; animated graphs

Lab: Probability problem set
Python program: RollDieDynamic.py
Jupyter notebook: PltAnimation.ipynb
6 Feb 27 Slides: Computer simulations; Monty Hall simulation program; Monty Hall with n doors and k cars; statistics; sampling; random vs. biased sampling; sampling error; estimates of the population mean; the Central Limit Theorem; sampling distribution of the sample means; number of samples; size of samples; standard error; sampling distributions of the mean, median, and standard deviation

Lab: Central Limit Theorem
Jupyter notebook: MontyHall.ipynb
7 Mar 5 Slides: Discrete vs. continuous random variables; area under the curve; normal probability distribution; standard normal distribution; standard normal distribution probabilities; confidence interval; critical values; level of significance; margin of error; small sample estimates; Student's t distribution; t confidence interval; interpretation of confidence intervals; hypothesis testing; test procedure; Type I and Type II errors; test statistic; null hypothesis rejection regions; hypothesis testing examples

Jupyter notebook: StandardNormal.ipynb
Fall 2019 midterm: Midterm-Fall2019.ipynb    Midterm-Solutions-Fall2019    TitanicSurvival.csv
8 Mar 12 Midterm

Video recording
Slides: Null vs. alternative hypothesis; small sample hypothesis tests; testing two population means with large and small samples
9 Mar 19 Video recording
Slides: Tactics for solving probability problems; midterm solutions; hypothesis testing and experiments; different significant levels; using P-values; dependent and independent variables; scatter plots; regression analysis; regression line; slope-intercept; residual values; least-squares line; coefficient of determination; correlation coefficient; correlation and causation; Assignment #7

Midterm solutions: midterm solutions
Jupyter notebooks: StandardNormal.ipynb   LeastSquaresLine.ipynb   CoeffOfDet.ipynb   Correlation.ipynb
10 Mar 26 Video recording
Slides: Python regression analysis functions; NY City temperatures example; time-series analysis; linear trend over time; moving averages; exponential smoothing; another perspective on linear regression; multiple linear regression; normal equations; home prices example; introduction to machine learning; supervised; unsupervised; steps for doing ML; time-series analysis via ML example; split the data; train the estimator; test the model; make predictions; multiple regression via ML; California housing example; underfitting and overfitting;

Jupyter notebooks: NYCTemps.ipynb   TimeSeries.ipynb HomePrices.ipynb NYCTempsML.ipynb   CaliforniaHousing.ipynb
Least squares module: LeastSquaresLine.py
Dataset: ave_hi_nyc_jan_1895-2018.csv
11 Apr 9 Video recording
Slides: scikit-learn machine learning algorithms; "Big Data"; supervised ML; k-nearest neighbor ML classification algorithm with the Digits dataset; training and testing the KNN model; confusion matrix; classification report; unsupervised ML; dimensionality reduction; TSNE estimator; k-means clustering algorithm with the Iris dataset; principal component analysis; Assignment #9

Jupyter notebooks: k-NearestNeighbors.ipynb   k-DimensionalityReduction.ipynb   k-MeansClustering.ipynb
12 Apr 16 Video recording Password B6*2\$OE?
Slides: Natural language processsing; NLP examples; Assignment #10; linear algebra; vectors; vector arithmetic; normalize a vector; vectors and NumPy; matrices; matrix arithmetic; matrix inverse; matrices and NumPy; Hilbert matrices; graphic transformation matrix

Jupyter notebooks: nlp.zip   matrices.zip
13 Apr 23 Video recording Password 2W!=1^O!
Slides: Linear equations; home prices example; solve a system of linear equations; equivalent systems; graphical solution; consistent and inconsistent systems; ill-conditioned systems; augmented matrix solution; row echelon form; Gaussian elimination solution; matrix representation; calculate a matrix inverse; solution using an inverse; singular matrix and system; least-squares solution with matrices; QR factorization; linearly independent; orthonormal columns; upper-triangular; Gram-Schmidt Orthonormalization Process

Jupyter notebooks: LinearEquationPlots.ipynb    MatrixInverse.ipynb    HomePricesSolve.ipynb    LeastSquaresMatrix.ipynb    MultivariateMatrix.ipynb    QR.ipynb    Splines.ipynb
14 Apr 30 Video recording Password 0F%o5CoI
Slides: Nonlinear relationships; normal equations; polynomial regression; polynomial regression with matrices and QR factorization; Markov chain; Markov steady state

Jupyter notebooks: PolynomialRegression.py    PolynomiaRegression1.ipynb    PolynomiaRegression2.ipynb    PolynomiaRegression3.ipynb    PolynomiaRegression4.ipynb    PolynomialRegressionMatrices.ipynb    MarkovChain.ipynb
15 May 7 Video recording Password 4O+Tvg1N
Slides: Eigenvalues and eigenvectors; uses; geometric interpretations; normalized eigenvectors; compute eigenvalues; compute eigenvectors; numpy.linalg.eig() function; matrix powers

Jupyter notebooks: EigenFactorization.py    MatrixPowers.py    VectorPlots.ipynb    TestEigenFactorization1.ipyn    TestEigenFactorization2.ipyn    TestEigenFactorization3.ipyn    TimeEigen.ipynb    TestMatrixPower.ipynb
Fall 2019 final exam: Final-Fall2019.ipynb    Final-Solutions-Fall2019

### Goals

Become familiar with the mathematical foundations for data analytics, including probability, statistics, and linear algebra. Obtain practical experience solving problems with the Python data analytics modules and functions.

### Course Learning Outcomes (CLO)

• CLO 1: Understand the probabilistic and statistical foundations of data analytics.
• CLO 2: Understand and apply linear algebra for data analytics.
• CLO 3: Develop programming skills to facilitate objectives 1 and 2.

### Prerequisites

Instructor consent.

### Required text

 Introduction to Statistics & Data Analysis, 6th edition Roxy Peck, Tom Short, and Chris Olsen Cengage, 2019 978-1-337-79361-2

### Recommended books

You can find most of what you’ll need online. Sample tutorials:
But if you like to read textbooks, here are some suggestions.
Each of these books has many more topics in much greater depth
than this course will cover.
 Linear Algebra: A Modern Introduction, 4th edition David Poole Cengage Learning, 2015 978-1-285-46324-7 Linear Algebra (Schaum’s Outline), 6th edition Seymour Lipschutz and Marc Lipson McGraw-Hill, 2018 978-1-260-01144-9 Data Science from Scratch: First Principles with Python, 2nd edition Joel Grus O’Reilly, 2019 978-1-492-04113-9 Python Data Science Handbook Jake VanderPlas O’Reilly, 2017 978-1-491-91205-8