San Jose State University : Site Name

Navigation

Main Content

Working in Mars Mission Control, JPL

Ronald Mak

Department of Computer Engineering
Department of Computer Science
Department of Applied Data Science
Spring Semester 2020

Office hours: TuTh: 4:30-5:30 PM
Office location: ENG 250
E-mail: ron.mak@sjsu.edu
Mission Control, Jet Propulsion Laboratory (JPL)
NASA Mars Exploration Rover Mission

DATA 220 Mathematical Methods for Data Analysis


Th  6:00 - 8:45 PM  room Health Building HB 106


Assignments

# Assigned Due Assignment
1 Jan 23 Jan 30 CSV datasets and Jupyter notebooks

Jupyter notebooks: TitanicCSV.ipynb    AirlineSafetyCSV.ipynb    BostonCrimeCSV.ipynb
CSV files: crimes-in-boston.zip
2 Jan 30 Feb 6 Seaborn Bar Charts of Random Values
3 Feb 6 Feb 13 Analysis of a Dataset

Example analysis: TitanicAnalysis.ipynb    TitanicSurvival.csv
4 Feb 13 Feb 20 Combinatorics and Probability Problem Set

Solutions: Assignment4-solutions.ipyn
5 Feb 20 Feb 27 Probability Problem Set

Solutions: Assignment5-solutions.ipyn
6 Feb 27 Mar 5 The Central Limit Theorem
7 Mar 19 Mar 26 Linear regression
8 Mar 26 Apr 9 Multiple Regression Analysis
9 Apr 9 Apr 16 Supervised and Unsupervised Machine Learning
10 Apr 16 Apr 23 Text analysis
11 Apr 25 Apr 30 Matrix Operations Problem Set

Solutions: Assignment11-solutions.ipyn
12 May 2 May 7 Polynomial Regression and Markov Chain Problem Set

Solutions: Assignment12-solutions.ipyn    PolynomialRegression.py

Lectures

Week Date Content
1 Jan 23 Slides: Introduction to data analytics; What is Data Science? history of data collection; history of data analysis; Python libraries; datasets; load CSV files into dataframes; statistics and machine learning; data scientist skillset;

Lab: Install Anaconda
Lab: Load CSV datasets into dataframes
Jupyter notebooks: TitanicCSV.ipynb    AirlineSafetyCSV.ipynb    BostonCrimeCSV.ipynb
CSV files: crimes-in-boston.zip
2 Jan 30 Slides: Big Data; Jupyter notebooks; IPython; lists; indexing; length; mutable; two-dimensional; unpacking; sort; search; list comprehension; list operations; tuples; NumPy arrays; Seaborn; bar chart; random values

Lab: Seaborn Bar Charts
Jupyter notebooks: AgeBarChart.ipynb    5.02-Lists.ipynb    5.03-Tuples.ipynb    5.05-Slicing.ipynb    5.06-Deletion.ipynb    5.08-Sorting.ipynb    5.09-Searching.ipynb    5.10-OtherMethods.ipynb    5.12-Comprehensions.ipynb
3 Feb 6 Slides: Statistics; sums; measures of central tendency; mean; weighted average; median; mode; measures of variability; range; percentiles; quartiles; interquartile range (IRQ); variance; standard deviation; zip; data analysis with the Titanic Survival dataset

Lab: Descriptive Statistics
Jupyter notebooks: sum.ipynb    mean.ipynb    weighted_average.ipynb    median.ipynb    mode.ipynb    range.ipynb    quartiles.ipynb    IQR.ipynb    stdev.ipynb    ziptest.ipynb    TitanicAnalysis.ipynb    TitanicSurvival.csv    boxplot.png
4 Feb 13 Slides: Counting principles; factorial notation; count the complement; counting when order doesn't matter; binomial coefficients; collections that allow repetitions; permutations and combinations; uncertainty; probability: classical and relative frequency interpretations; basic probability laws; Venn diagrams

Lab: Combinatorics and probability problem set
5 Feb 20 Slides: Conditional probability; independent vs. dependent events; Bayes' Theorem; Thomas Bayes and Bayesian statistics; disease test example; Monty Hall Problem; discrete and continuous random variables; probability distributions: uniform, normal, exponential, binomial, Poisson; expected value; animated graphs

Lab: Probability problem set
Python program: RollDieDynamic.py
Jupyter notebook: PltAnimation.ipynb
6 Feb 27 Slides: Computer simulations; Monty Hall simulation program; Monty Hall with n doors and k cars; statistics; sampling; random vs. biased sampling; sampling error; estimates of the population mean; the Central Limit Theorem; sampling distribution of the sample means; number of samples; size of samples; standard error; sampling distributions of the mean, median, and standard deviation

Lab: Central Limit Theorem
Jupyter notebook: MontyHall.ipynb
7 Mar 5 Slides: Discrete vs. continuous random variables; area under the curve; normal probability distribution; standard normal distribution; standard normal distribution probabilities; confidence interval; critical values; level of significance; margin of error; small sample estimates; Student's t distribution; t confidence interval; interpretation of confidence intervals; hypothesis testing; test procedure; Type I and Type II errors; test statistic; null hypothesis rejection regions; hypothesis testing examples

Jupyter notebook: StandardNormal.ipynb
Fall 2019 midterm: Midterm-Fall2019.ipynb    Midterm-Solutions-Fall2019    TitanicSurvival.csv
8 Mar 12 Midterm

Video recording
Slides: Null vs. alternative hypothesis; small sample hypothesis tests; testing two population means with large and small samples
9 Mar 19 Video recording
Slides: Tactics for solving probability problems; midterm solutions; hypothesis testing and experiments; different significant levels; using P-values; dependent and independent variables; scatter plots; regression analysis; regression line; slope-intercept; residual values; least-squares line; coefficient of determination; correlation coefficient; correlation and causation; Assignment #7

Midterm solutions: midterm solutions
Jupyter notebooks: StandardNormal.ipynb   LeastSquaresLine.ipynb   CoeffOfDet.ipynb   Correlation.ipynb
10 Mar 26 Video recording
Slides: Python regression analysis functions; NY City temperatures example; time-series analysis; linear trend over time; moving averages; exponential smoothing; another perspective on linear regression; multiple linear regression; normal equations; home prices example; introduction to machine learning; supervised; unsupervised; steps for doing ML; time-series analysis via ML example; split the data; train the estimator; test the model; make predictions; multiple regression via ML; California housing example; underfitting and overfitting;

Jupyter notebooks: NYCTemps.ipynb   TimeSeries.ipynb HomePrices.ipynb NYCTempsML.ipynb   CaliforniaHousing.ipynb
Least squares module: LeastSquaresLine.py
Dataset: ave_hi_nyc_jan_1895-2018.csv
11 Apr 9 Video recording
Slides: scikit-learn machine learning algorithms; "Big Data"; supervised ML; k-nearest neighbor ML classification algorithm with the Digits dataset; training and testing the KNN model; confusion matrix; classification report; unsupervised ML; dimensionality reduction; TSNE estimator; k-means clustering algorithm with the Iris dataset; principal component analysis; Assignment #9

Jupyter notebooks: k-NearestNeighbors.ipynb   k-DimensionalityReduction.ipynb   k-MeansClustering.ipynb  
12 Apr 16 Video recording Password B6*2$OE?
Slides: Natural language processsing; NLP examples; Assignment #10; linear algebra; vectors; vector arithmetic; normalize a vector; vectors and NumPy; matrices; matrix arithmetic; matrix inverse; matrices and NumPy; Hilbert matrices; graphic transformation matrix

Jupyter notebooks: nlp.zip   matrices.zip
13 Apr 23 Video recording Password 2W!=1^O!
Slides: Linear equations; home prices example; solve a system of linear equations; equivalent systems; graphical solution; consistent and inconsistent systems; ill-conditioned systems; augmented matrix solution; row echelon form; Gaussian elimination solution; matrix representation; calculate a matrix inverse; solution using an inverse; singular matrix and system; least-squares solution with matrices; QR factorization; linearly independent; orthonormal columns; upper-triangular; Gram-Schmidt Orthonormalization Process

Jupyter notebooks: LinearEquationPlots.ipynb    MatrixInverse.ipynb    HomePricesSolve.ipynb    LeastSquaresMatrix.ipynb    MultivariateMatrix.ipynb    QR.ipynb    Splines.ipynb
14 Apr 30 Video recording Password 0F%o5CoI
Slides: Nonlinear relationships; normal equations; polynomial regression; polynomial regression with matrices and QR factorization; Markov chain; Markov steady state

Jupyter notebooks: PolynomialRegression.py    PolynomiaRegression1.ipynb    PolynomiaRegression2.ipynb    PolynomiaRegression3.ipynb    PolynomiaRegression4.ipynb    PolynomialRegressionMatrices.ipynb    MarkovChain.ipynb
15 May 7 Video recording Password 4O+Tvg1N
Slides: Eigenvalues and eigenvectors; uses; geometric interpretations; normalized eigenvectors; compute eigenvalues; compute eigenvectors; numpy.linalg.eig() function; matrix powers

Jupyter notebooks: EigenFactorization.py    MatrixPowers.py    VectorPlots.ipynb    TestEigenFactorization1.ipyn    TestEigenFactorization2.ipyn    TestEigenFactorization3.ipyn    TimeEigen.ipynb    TestMatrixPower.ipynb
Fall 2019 final exam: Final-Fall2019.ipynb    Final-Solutions-Fall2019

Goals

Become familiar with the mathematical foundations for data analytics, including probability, statistics, and linear algebra. Obtain practical experience solving problems with the Python data analytics modules and functions.

Course Learning Outcomes (CLO)

Prerequisites

Instructor consent.

Required text

Introduction to Statistics & Data Analysis, 6th edition
Roxy Peck, Tom Short, and Chris Olsen
Cengage, 2019
978-1-337-79361-2

Recommended books

You can find most of what you’ll need online. Sample tutorials: But if you like to read textbooks, here are some suggestions.
Each of these books has many more topics in much greater depth
than this course will cover.
Linear Algebra: A Modern Introduction, 4th edition
David Poole
Cengage Learning, 2015
978-1-285-46324-7
Linear Algebra (Schaum’s Outline), 6th edition
Seymour Lipschutz and Marc Lipson
McGraw-Hill, 2018
978-1-260-01144-9
Data Science from Scratch: First Principles with Python, 2nd edition
Joel Grus
O’Reilly, 2019
978-1-492-04113-9
Python Data Science Handbook
Jake VanderPlas
O’Reilly, 2017
978-1-491-91205-8