{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "###
San Jose State University
Department of Applied Data Science
\n", "#
DATA 220
Mathematical Methods for Data Analysis
\n", "###
Spring 2021
Instructor: Ron Mak
\n", "#
Assignment #10
Problem Set
SOLUTIONS
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some useful functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def calculate_means_and_sums(x_values, y_values):\n", " \"\"\"\n", " Calculate means and sums.\n", " @param x_values the independent x values.\n", " @param y_values the dependent y values.\n", " @return a list of n, sum_x, sum_y, sum_xx, sum_yy, and sum_xy.\n", " \"\"\"\n", " x = np.array(x_values)\n", " y = np.array(y_values)\n", " \n", " n = len(x)\n", " mean_x = np.mean(x)\n", " mean_y = np.mean(y)\n", " \n", " sum_x = np.sum(x)\n", " sum_y = np.sum(y)\n", " sum_xx = np.sum(x*x)\n", " sum_yy = np.sum(y*y)\n", " sum_xy = np.sum(x*y)\n", "\n", " return n, mean_x, mean_y, sum_x, sum_y, sum_xx, sum_yy, sum_xy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def calculate_slope_intercept(x_values, y_values):\n", " \"\"\"\n", " Calculate the slope and intercept of a regression line.\n", " @param x_values the independent x values.\n", " @param y_values the dependent y values.\n", " @return a list of the slope and y-intercept of the line.\n", " \"\"\"\n", " n, mean_x, mean_y, sum_x, sum_y, sum_xx, sum_yy, sum_xy = \\\n", " calculate_means_and_sums(x_values, y_values)\n", "\n", " numerator = sum_xy - n*mean_x*mean_y\n", " denominator = sum_xx - n*mean_x*mean_x\n", "\n", " m = numerator/denominator\n", " b = (mean_y - m*mean_x)\n", " \n", " return m, b # slope and intercept" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "def show_least_squares_line(title, x_label, y_label, \n", " x_values, y_values):\n", " \"\"\"\n", " @param title the chart title.\n", " @param x_label the x-axis label.\n", " @param y_label the y-axis label.\n", " @param x_values the independent x values to plot.\n", " @param y_values the dependent y values to plot.\n", " \"\"\"\n", " # First show the scatter plot.\n", " plt.scatter(x_values, y_values)\n", " \n", " # Now show the least squares line.\n", " m, b = calculate_slope_intercept(x_values, y_values)\n", " reg_line = [m*x + b for x in x_values] # regression line\n", " plt.plot(x_values, reg_line, color='red')\n", "\n", " plt.title(f'{title}, m = {m:.2f}, b = {b:.2f}')\n", " plt.xlabel(x_label)\n", " plt.ylabel(y_label)\n", "\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def calculate_y_variations(x_values, y_values):\n", " \"\"\"\n", " Calculate the variations of the y values.\n", " @param x_values the independent x values.\n", " @param y_values the dependent y values.\n", " @return a list of the residual and total sums of squares.\n", " \"\"\"\n", " n, mean_x, mean_y, sum_x, sum_y, sum_xx, sum_yy, sum_xy = \\\n", " calculate_means_and_sums(x_values, y_values)\n", "\n", " numerator = sum_xy - n*mean_x*mean_y\n", " denominator = sum_xx - n*mean_x*mean_x\n", " \n", " m = numerator/denominator\n", " b = (mean_y - m*mean_x)\n", " \n", " ss_residual = sum_yy - b*sum_y - m*sum_xy\n", " ss_total = sum_yy - (sum_y*sum_y)/n\n", " \n", " return ss_residual, ss_total" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def calculate_r2(ss_residual, ss_total):\n", " \"\"\"\n", " Calculate the coefficient of determination r2.\n", " @param ss_residual the residual sum of squares.\n", " @param ss_total the total sum of squares.\n", " @return the coefficient of determination.\n", " \"\"\"\n", " cofd = 1 - ss_residual/ss_total\n", " return cofd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "def calculate_r(x_values, y_values):\n", " \"\"\"\n", " Calculate the correlation coefficient r.\n", " @param x_values the independent x values.\n", " @param y_values the dependent y values.\n", " @return correlation coefficient.\n", " \"\"\"\n", " n, mean_x, mean_y, sum_x, sum_y, sum_xx, sum_yy, sum_xy = \\\n", " calculate_means_and_sums(x_values, y_values)\n", " \n", " numerator = (sum_xy - (sum_x*sum_y)/n)\n", " denominator = math.sqrt(sum_xx - (sum_x*sum_x)/n) * math.sqrt(sum_yy - (sum_y*sum_y/n))\n", " \n", " r = numerator/denominator\n", " return r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 1. A news article discusses the number of promises made by political candidates and the number of promises kept after they are elected. The table below shows the promises made by the candidates X and the promises Y each kept. Create a scatter plot with a linear regression line. Calculate the correlation coefficient. What is the relationship between promises made and promised kept?\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Promises made by candidates, X20303040505060
Promises kept, Y7654321
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "made = [20, 30, 30, 40, 50, 50, 60] # promises made\n", "kept = [ 7, 6, 5, 4, 3, 2, 1] # promises kept\n", "\n", "show_least_squares_line('Promises Made vs. Promises Kept', \n", " 'Promises Made', 'Promises Kept', made, kept)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f'correlation coefficient r = {calculate_r(made, kept):.3f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a very strong negative correlation. It appears that the politicians who make the most promises keep the fewest promises. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 2. Farmer Brown and Farmer Green each plant 100 acres of corn under similar conditions. Farmer Brown's yield was 84 bushels per acre with a standard deviation of 5 bushels. Farmer Green's yield was 80 bushels per acre with a standard deviation of 6 bushels. What is the 90% confidence interval for the mean difference in yield between the two farms?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're comparing two population means with two samples, each of size 100 acres. Let Farmer Brown have farm #1 and Farmer Green have farm #2. The Central Limit Theorem applies, and we use the test statistic whose formula is given in the March 18 lecture, slide #6. The critical value for z at the 90% confidence level is 1.645." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "import scipy.stats as stats\n", "\n", "# Farmer Brown\n", "n1 = 100\n", "x1_bar = 84\n", "x1_sigma = 5\n", "\n", "# Farmer Green\n", "n2 = 100\n", "x2_bar = 80\n", "x2_sigma = 6\n", "\n", "level = 90 # confidence level\n", "z_cv = -stats.norm.ppf((1 - level/100)/2) # z critical value\n", "\n", "numerator = x1_bar - x2_bar\n", "denominator = math.sqrt((x1_sigma*x1_sigma)/n1 + (x2_sigma*x2_sigma)/n2)\n", "z = numerator/denominator\n", "\n", "ci_lo = -z_cv*denominator + numerator\n", "ci_hi = z_cv*denominator + numerator\n", "\n", "print(f'The {level}% confidence interval for the mean difference ' +\n", " f'in yields is ({ci_lo:.3f}, {ci_hi:.3f})')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 3. SJSU bought 36 bulbs for its classroom projectors. The manufacturer claimed that the bulbs should each last more than 800 hours. After using the bulbs, the university discovered that their mean lifetime was 816 hours with a standard deviation of 70 hours. At the 5% level of significance, should the university accept the manufacturer's claim?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let the null hypothesis H0: µ = 800 hours and the alternative hypothesis Ha: µ > 800 hours. This is therefore an upper-tailed test and the z critical value is 1.645." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# H0: µ = 800\n", "# Ha: µ > 800\n", "\n", "import math\n", "import scipy.stats as stats\n", "\n", "mu = 800\n", "n = 36\n", "x_bar = 816\n", "x_sigma = 70\n", "std_err = x_sigma/math.sqrt(n)\n", "\n", "print(f' n = {n}')\n", "print(f' x_bar = {x_bar:.2f}')\n", "print(f' x_sigma = {x_sigma:.2f}')\n", "print(f' std_err = {std_err:.2f}')\n", "\n", "level = 0.05\n", "z_cv = -stats.norm.ppf(level) # critical value for an upper-tailed test\n", "z = (x_bar - mu)/std_err # test statistic\n", "\n", "print()\n", "print(f' level = {level:.2f}')\n", "print(f'z critical value = {z_cv:.2f}')\n", "print(f' z = {z:.2f}')\n", "\n", "print()\n", "print(f'Reject H0 if z > {z_cv:.2f}')\n", "\n", "reject = z > z_cv\n", "\n", "print('Therefore, ', end = '')\n", "if reject:\n", " print('reject H0: µ = 800.')\n", "else:\n", " print('do not reject H0: We cannot accept the claim that µ > 800.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 4. A widget making machine is supposed to make widgets that are 0.05 inches thick. Ten widgets are tested each day to ensure the machine is properly calibrated. Today's test averaged 0.053 inches thick with a standard deviation of 0.003. At the 5% level of significance, is the machine properly calibrated?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let the null hypothesis H0: µ = 0.05 inches and the alternative hypothesis Ha: µ ≠ 0.05 inches for a two-tailed test. Ten is a small sample size, so we must use Student's t. At the 5% level of significance and 9 degrees of freedom, the t critical value is 2.26, which we can look up in a table or use the scipy.stats.t.ppf function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# H0: µ = 0.05\n", "# Ha: µ ≠ 0.05\n", "\n", "import math\n", "import scipy.stats as stats\n", "\n", "n = 10\n", "mu = 0.05\n", "x_bar = 0.053\n", "x_sigma = 0.003\n", "std_err = x_sigma/math.sqrt(n)\n", "\n", "print(f' n = {n}')\n", "print(f' x_bar = {x_bar:.4f}')\n", "print(f' x_sigma = {x_sigma:.4f}')\n", "print(f' std_err = {std_err:.4f}')\n", "\n", "level = 0.05 # level of significance\n", "df = n - 1 # degrees of freedom\n", "\n", "t_cv = -stats.t.ppf(level/2, df) # t critical value for a two-tailed test\n", "t = (x_bar - mu)/std_err # test statistic\n", "\n", "print()\n", "print(f' level = {level:.2f}')\n", "print(f't critical value = {t_cv:.2f}')\n", "print(f' t = {t:.2f}')\n", "\n", "print()\n", "print(f'Reject H0 if (t < {-t_cv:.2f}) or (t > {t_cv:.2f})')\n", "\n", "reject = (t < -t_cv) or (t > t_cv)\n", "\n", "print('Therefore, ', end = '')\n", "if reject:\n", " print('reject H0: The machine needs to be recalibrated.')\n", "else:\n", " print('do not reject H0: We cannot conclude that the machine is not properly calibrated.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 5. Farmer Brown's cows have averaged 380 pounds in weight. He selects 50 of his cows to put on a new diet to make them gain weight. After being on the new diet, the selected cows weigh an average of 390 pounds with a standard deviation of 35.2 pounds. What are the null and alternative hypotheses? What is the lowest level of significance at which we can reject the null hypothesis?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an upper-tailed test. We need the area under the curve to the right of the z test statistic. We can use the cumulative distribution function math.stats.norm.cdf (which gives us the area to the left of the test statistic) and take 1 minus the returned value. The resulting area to the right is the P-value. If we then choose any level of significance α smaller than this P-value, we cannot reject H0 at that level." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# H0: µ = 380\n", "# Ha: µ > 380\n", "\n", "import math\n", "import scipy.stats as stats\n", "from scipy.stats import norm\n", "\n", "mu = 380\n", "n = 50\n", "x_bar = 390\n", "x_sigma = 35.2\n", "std_err = x_sigma/math.sqrt(n)\n", "\n", "print(f' n = {n}')\n", "print(f' x_bar = {x_bar:.2f}')\n", "print(f'x_sigma = {x_sigma:.2f}')\n", "print(f'std_err = {std_err:.2f}')\n", "\n", "z = (x_bar - mu)/std_err # test statistic\n", "p_value = 1 - norm.cdf(z) # upper-tailed test\n", "\n", "print()\n", "print(f' z = {z:.2f}')\n", "print(f'P-value = {p_value:.4f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 6. Five friends try a new health club to try to lose weight. They record how many pounds each lost and the number of weeks it took. Assume there is a linear relationship between the pounds lost and the number of weeks. What is the slope of the regression line? What is its slope? Create a scatter graph of the data and the regression line.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Friend #:12345
Weeks:32145
Pounds:654911
" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "This is a time series. The number of weeks is the independent variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "weeks = [3, 2, 1, 4, 5]\n", "pounds = [6, 5, 4, 9, 11]\n", "\n", "show_least_squares_line('Pounds Lost vs. Weeks', 'Weeks on Diet', 'Pounds Lost', \n", " weeks, pounds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 7. Use the data from Problem 6. Calculate and print:\n", "- the variation in the pounds lost that cannot be explained by the weeks on the diet\n", "- the total variation in the pounds lost\n", "- the percentage of variation that is explained by the weeks on the diet\n", "- the correlation coefficient " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The variation in the pounds lost that cannot be explained by the weeks on the diet is the residual sum of squares (deviations from the regression line, or predicted weekly pounds lost). The total variation in pounds lost is the total sum of squares (deviations from the average pounds lost). The coefficient of determination r2 is the proportion of variation in y that can be attributed to the linear relationship between weeks and pounds lost. The correlation coefficient r is the square root of r2." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "ss_residual, ss_total = calculate_y_variations(weeks, pounds)\n", "\n", "print(f'unexplained variation = ss_residual = {ss_residual:8.5f}')\n", "print(f' total variation = ss_total = {ss_total:8.5f}')\n", "\n", "r2 = calculate_r2(ss_residual, ss_total)\n", "r = calculate_r(weeks, pounds)\n", "\n", "print()\n", "print(f'percentage of variation that is explained by the weeks on the diet = {r2:.1%}')\n", "print(f'correlation coefficient r = {r:.3f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### PROBLEM 8. The table below shows annual snowfall. Create a time series graph and plot the linear trend.\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
YearSnowfall
19939.9
199422.2
199511.4
199614.8
199719.7
199814.9
199915.9
200013.4
200112.0
20027.9
200312.9
200416.8
200511.6
200614.9
200713.3
200820.2
200914.3
201020.4
201113.0
201210.0
201315.9
201417.4
201518.7
201614.3
201716.2
201821.1
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the time-series functions given in class for trend, moving averages, and exponential smoothing. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import statistics as stat\n", "import matplotlib.pyplot as plt\n", "\n", "def show_time_lines(title, x_label, y_label, x_values, y_values, \n", " regression='yes', mv_amt_1=0, mv_amt_2=0, \n", " alpha_1=0, alpha_2=0, \n", " size=()):\n", " \"\"\"\n", " @param title the chart title.\n", " @param x_label the x-axis label.\n", " @param y_label the y-axis label.\n", " @param x_values the independent x values to plot.\n", " @param y_values the dependent y values to plot.\n", " \"\"\"\n", " if size != ():\n", " figure = plt.figure(figsize=size)\n", "\n", " last = []\n", " plt.xticks(x_values)\n", " \n", " # First show the data line.\n", " plt.plot(x_values, y_values, linewidth=5)\n", " \n", " last_1 = []\n", " last_2 = []\n", " linear_line = []\n", " mv_avg_line_1 = []\n", " mv_avg_line_2 = []\n", " smooth_line_1 = []\n", " smooth_line_2 = []\n", "\n", " if regression == 'yes':\n", " m, b = calculate_slope_intercept(x_values, y_values)\n", " \n", " # Calculate the least squares trend line, \n", " # the simple moving average line,\n", " # and the exponentially smoothed line.\n", " for i in range(len(x_values)):\n", " x = x_values[i]\n", " y = y_values[i]\n", " \n", " # Linear line\n", " if regression == 'yes': \n", " y_hat = m*x + b\n", " linear_line.append(y_hat)\n", " \n", " # Moving average line #1\n", " if mv_amt_1 != 0:\n", " last_1.append(y)\n", " \n", " if len(last_1) == mv_amt_1:\n", " mv_avg_line_1.append(stat.mean(last_1))\n", " last_1.pop(0)\n", " \n", " # Moving average line #2\n", " if mv_amt_2 != 0:\n", " last_2.append(y)\n", " \n", " if len(last_2) == mv_amt_2:\n", " mv_avg_line_2.append(stat.mean(last_2))\n", " last_2.pop(0)\n", "\n", " # Exponentially smoothed line #1\n", " if alpha_1 != 0:\n", " if i == 0:\n", " y_smooth_1 = y\n", " else:\n", " y_smooth_1 = alpha_1*y + (1 - alpha_1)*y_smooth_1\n", " \n", " smooth_line_1.append(y_smooth_1)\n", " \n", " # Exponentially smoothed line #2\n", " if alpha_2 != 0:\n", " if i == 0:\n", " y_smooth_2 = y\n", " else:\n", " y_smooth_2 = alpha_2*y + (1 - alpha_2)*y_smooth_2\n", " \n", " smooth_line_2.append(y_smooth_2)\n", "\n", " if regression == 'yes':\n", " plt.plot(xs, linear_line, color='red')\n", " \n", " if mv_amt_1 != 0:\n", " plt.plot(xs[mv_amt_1 - 1:], mv_avg_line_1, color='lightgreen')\n", " \n", " if mv_amt_2 != 0:\n", " plt.plot(xs[mv_amt_2 - 1:], mv_avg_line_2, color='green')\n", " \n", " if alpha_1 != 0:\n", " plt.plot(xs, smooth_line_1, color='gray', linestyle='--')\n", " \n", " if alpha_2 != 0:\n", " plt.plot(xs, smooth_line_2, color='black', linestyle='--')\n", "\n", " plt.title(title)\n", " plt.xlabel(x_label)\n", " plt.ylabel(y_label)\n", "\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "xs = list(range(1993, 2019))\n", "ys = [9.9, 22.2, 11.4, 14.8, 19.7, 14.9, 15.9, 13.4, 12, 7.9, 12.9, \n", " 16.8, 11.6, 14.9, 13.3, 20.2, 14.3, 20.4, 13, 10, 15.9, 17.4, \n", " 18.7, 14.3, 16.2, 21.1]\n", "\n", "show_time_lines('Annual Snowfall', 'Years', 'Feet of Snow', xs, ys,\n", " size=(15, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 9. Using the table from Problem 8, plot the 3- and 5-year moving averages." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Higher numbers of days in moving averages should dampen data fluctuations more. Light green is 3 days and dark green is 5 days." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "show_time_lines('Annual Snowfall', 'Years', 'Feet of Snow', xs, ys,\n", " size=(15, 10), regression='no',\n", " mv_amt_1=3, mv_amt_2=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "#### PROBLEM 10. Using the table from Problem 8, exponentially smooth the data using alpha values 0.3 and 0.1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lower alpha values for exponential smoothing should smooth data fluctuations more. Light gray is alpha 0.3 and dark gray is alpha 0.1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "show_time_lines('Annual Snowfall', 'Years', 'Feet of Snow', xs, ys,\n", " size=(15, 10), regression='no',\n", " alpha_1=0.3, alpha_2=0.1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }