{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis of the Titanic Survival Dataset \n", "### Load the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "titanic = pd.read_csv('TitanicSurvival.csv') \n", "titanic.columns = ['name', 'survived', 'sex', 'age', 'passengerClass']\n", "\n", "pd.set_option('precision', 2) # format for floating-point values\n", "titanic.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Age counts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "passenger_count = len(titanic.age)\n", "\n", "good_ages = [age for age in titanic.age if not math.isnan(age)]\n", "good_ages_count = len(good_ages)\n", "\n", "mean_age = sum(good_ages)/good_ages_count\n", "\n", "print(f'count of all ages = {passenger_count}')\n", "print(f'count of good ages = {good_ages_count}')\n", "print(f' mean of good ages = {mean_age:.2f}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sorted_good_ages = sorted(good_ages)\n", "print(f'len(sorted(good_ages)) = {len(sorted_good_ages)}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "min_age = sorted_good_ages[0]\n", "max_age = sorted_good_ages[-1]\n", "\n", "print(f'min of ages = {min_age:.2f}')\n", "print(f'max of ages = {max_age:.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Descriptive age statistics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "titanic.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "titanic.hist() # passenger age only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Age mean, median, variance, and standard deviation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use the statistics module" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import statistics as stat\n", "\n", "print(f'stat.mean = {stat.mean(good_ages):.2f}')\n", "print(f'stat.median = {stat.median(good_ages):.2f}')\n", "print(f'stat.pvariance = {stat.pvariance(good_ages):.2f}')\n", "print(f'stat.pstdev = {stat.pstdev(good_ages):.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compute from the definitions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mid = good_ages_count//2\n", "\n", "if good_ages_count%2 == 1:\n", " median_age = sorted_good_ages[mid]\n", "else:\n", " median_age = (sorted_good_ages[mid] + sorted_good_ages[mid - 1])/2\n", " \n", "print(f'median of good ages = {median_age:.2f}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sum_of_squares = 0\n", "\n", "for age in good_ages:\n", " sum_of_squares += (age - mean_age)**2\n", " \n", "variance = sum_of_squares/good_ages_count\n", "stdev = math.sqrt(variance)\n", "\n", "print(f'sum sqrs of ages = {sum_of_squares:.2f}')\n", "print(f'variance of ages = {variance:.2f}')\n", "print(f' std dev of ages = {stdev:.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ages in each passenger class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ziplist = list(zip(titanic.age, titanic.passengerClass))\n", "\n", "for i in range(10):\n", " print(f'{i:5}: {ziplist[i]}')\n", " \n", "print(' ...')\n", "\n", "for i in range(len(ziplist) - 10, len(ziplist)):\n", " print(f'{i:5}: {ziplist[i]}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def ages_in_class(klass):\n", " \"\"\" \n", " Return the list of ages of passengers in a given class.\n", " @param klass the given class.\n", " @return the list of ages.\n", " \"\"\"\n", " # Count only good ages.\n", " return [age for age, kls in zip(titanic.age, titanic.passengerClass) \n", " if not math.isnan(age) and (kls == klass)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ages_1st_class = ages_in_class('1st')\n", "ages_2nd_class = ages_in_class('2nd')\n", "ages_3rd_class = ages_in_class('3rd')\n", "\n", "print(f'Found {len(ages_1st_class)} good ages in 1st class')\n", "print(f'Found {len(ages_2nd_class)} good ages in 2nd class')\n", "print(f'Found {len(ages_3rd_class)} good ages in 3rd class')\n", "\n", "total_ages = len(ages_1st_class) + len(ages_2nd_class) + len(ages_3rd_class)\n", "\n", "print()\n", "print(f'{total_ages} total ages found')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Survival rate of each passenger class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def status_in_class(klass, status):\n", " \"\"\"\n", " Compute the number of passengers in a given passenger class\n", " and how many of them survived or perished.\n", " @param klass the given class.\n", " @param status either 'yes' for survived, or 'no' if perished\n", " @return a tuple containing the number of passengers \n", " and the number who survived or perished\n", " \"\"\"\n", " \n", " class_count = 0\n", " status_count = 0\n", " \n", " for srv, kls in zip(titanic.survived, titanic.passengerClass):\n", " if kls == klass: # the class that we want?\n", " class_count += 1\n", " \n", " if srv == status: # survived or perished in that class?\n", " status_count += 1\n", " \n", " return (class_count, status_count)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "count_1st, survived_1st = status_in_class('1st', 'yes')\n", "count_2nd, survived_2nd = status_in_class('2nd', 'yes')\n", "count_3rd, survived_3rd = status_in_class('3rd', 'yes')\n", "\n", "# We already have the counts in each class.\n", "_, perished_1st = status_in_class('1st', 'no')\n", "_, perished_2nd = status_in_class('2nd', 'no')\n", "_, perished_3rd = status_in_class('3rd', 'no')\n", "\n", "total_passengers = len(titanic.survived)\n", "total_survived = survived_1st + survived_2nd + survived_3rd\n", "total_perished = perished_1st + perished_2nd + perished_3rd\n", "\n", "print(f'Out of {total_passengers} total passengers, ' +\n", " f'{total_survived} survived = {int(100*total_survived/total_passengers)}% ' +\n", " f'and {total_perished} perished = {int(100*total_perished/total_passengers)}%')\n", "print()\n", "\n", "pct_survived_1st = int(100*survived_1st/count_1st)\n", "pct_survived_2nd = int(100*survived_2nd/count_2nd)\n", "pct_survived_3rd = int(100*survived_3rd/count_3rd)\n", "\n", "print(f'Out of {count_1st} passengers in 1st class, ' +\n", " f'{survived_1st} survived = {pct_survived_1st}%')\n", "print(f'Out of {count_2nd} passengers in 2nd class, ' +\n", " f'{survived_2nd} survived = {pct_survived_2nd}%')\n", "print(f'Out of {count_3rd} passengers in 3rd class, ' +\n", " f'{survived_3rd} survived = {pct_survived_3rd}%')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Charts: Passengers in each class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Custom pie chart code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "print('Proportion of passengers by class')\n", "\n", "df_count_pie = pd.DataFrame({'Class' : ['1st', '2nd', '3rd'], \n", " 'Counts' : [count_1st, count_2nd, count_3rd]})\n", "df_count_pie.Counts.groupby(df_count_pie.Class).sum().plot(kind='pie')\n", "plt.axis('equal')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Pie chart function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def show_pie_chart(title, wedge_labels, wedge_values):\n", " \"\"\"\n", " Display a pie chart.\n", " @param title the chart title.\n", " @param wedge_labels the labels for the wedges\n", " @param wedge_values the values for the wedges\n", " \"\"\"\n", " \n", " print(title)\n", " \n", " df_count_pie = pd.DataFrame({'Class' : wedge_labels, \n", " 'Counts' : wedge_values})\n", " df_count_pie.Counts.groupby(df_count_pie.Class).sum().plot(kind='pie')\n", " plt.axis('equal')\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "classes = ['1st', '2nd', '3rd']\n", "counts = [count_1st, count_2nd, count_3rd]\n", "\n", "show_pie_chart('Proportion of passengers by class', classes, counts)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Custom bar chart code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "\n", "classes = ['1st', '2nd', '3rd']\n", "counts = [count_1st, count_2nd, count_3rd]\n", "\n", "sns.set_style('whitegrid') # white backround with gray grid lines\n", "axes = sns.barplot(classes, counts, palette='bright')\n", "axes.set_title('Count of passengers in each class')\n", "axes.set(xlabel='Passenger Class', ylabel='Count')\n", "\n", "# Scale the y-axis by 10% to make room for text above the bars.\n", "axes.set_ylim(top=1.10*max(counts))\n", "\n", "# Display the count above each patch (bar).\n", "for bar, count in zip(axes.patches, counts):\n", " text_x = bar.get_x() + bar.get_width()/2 \n", " text_y = bar.get_height() \n", " text = f'{count}'\n", " axes.text(text_x, text_y, text, \n", " fontsize=11, ha='center', va='bottom')\n", "\n", "plt.show() # display the chart " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bar chart function" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "\n", "def show_bar_chart(title, x_labels, y_label, x_values, y_values, bar_topper):\n", " \"\"\"\n", " Display a bar chart.\n", " @param title the chart title.\n", " @param x_labels the labels for the x axis\n", " @param y_label the label for the y axis\n", " @param x_values the x values to plot\n", " @param y_values the y values to plot\n", " @param bar_text the text above each bar\n", " \"\"\"\n", " \n", " sns.set_style('whitegrid') # white backround with gray grid lines\n", " axes = sns.barplot(x_values, y_values, palette='bright')\n", " axes.set_title(title)\n", " axes.set(xlabel=x_labels, ylabel=y_label)\n", "\n", " # Scale the y-axis by 10% to make room for text above the bars.\n", " axes.set_ylim(top=1.10*max(y_values))\n", "\n", " # Display the topper value above each patch (bar).\n", " for bar, topper in zip(axes.patches, bar_topper):\n", " text_x = bar.get_x() + bar.get_width() / 2.0 \n", " text_y = bar.get_height() \n", " axes.text(text_x, text_y, topper, \n", " fontsize=11, ha='center', va='bottom')\n", "\n", " plt.show() # display the chart " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "classes = ['1st', '2nd', '3rd']\n", "counts = [count_1st, count_2nd, count_3rd]\n", "toppers = [f'{count_1st}', f'{count_2nd}', f'{count_3rd}']\n", "\n", "show_bar_chart('Count of passengers in each class', \n", " 'Passenger Class', 'Count', \n", " classes, counts, toppers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "total_passengers = len(titanic.name)\n", "pct_total_1st = 100*count_1st/total_passengers\n", "pct_total_2nd = 100*count_2nd/total_passengers\n", "pct_total_3rd = 100*count_3rd/total_passengers\n", "\n", "classes = ['1st', '2nd', '3rd']\n", "pcts = [int(pct_total_1st), int(pct_total_2nd), int(pct_total_3rd)]\n", "toppers = [f'{pcts[0]}%', f'{pcts[1]}%', f'{pcts[2]}%']\n", "\n", "show_bar_chart('Percentage of passengers by class',\n", " 'Passenger Class', 'Percentage', \n", " classes, pcts, toppers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Charts: Survival rates in each passenger class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prop_1st = survived_1st/total_survived\n", "prop_2nd = survived_2nd/total_survived\n", "prop_3rd = survived_3rd/total_survived\n", "\n", "classes = ['1st', '2nd', '3rd']\n", "proportions = [prop_1st, prop_2nd, prop_3rd]\n", "toppers = [f'{prop_1st:.2f}', f'{prop_2nd:.2f}', f'{prop_3rd:.2f}']\n", "\n", "show_bar_chart('Proportion of survivors by class',\n", " 'Passenger Class', 'Proportion', \n", " classes, proportions, toppers)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prop_1st = survived_1st/total_survived\n", "prop_2nd = survived_2nd/total_survived\n", "prop_3rd = survived_3rd/total_survived\n", "\n", "classes = ['1st', '2nd', '3rd']\n", "proportions = [prop_1st, prop_2nd, prop_3rd]\n", "\n", "show_pie_chart('Proportion of survivors by class', classes, proportions)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "classes = ['1st', '2nd', '3rd']\n", "pcts = [int(pct_survived_1st), int(pct_survived_2nd), int(pct_survived_3rd)]\n", "toppers = [f'{pcts[0]}%', f'{pcts[1]}%', f'{pcts[2]}%']\n", "\n", "show_bar_chart('Percentage of survivors in each class',\n", " 'Passenger Class', 'Percentage', \n", " classes, pcts, toppers)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Age quartiles and interquartile range (IQR) by passenger class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def age_iqr_for_class(klass):\n", " \"\"\"\n", " Compute the count, min, max, 1st, 2nd, and 3rd quartiles, \n", " and the interquartile range for ages in a given class.\n", " @param klass the given class.\n", " @return a tuple containing count, min, max, 1st, 2nd, and 3rd quartiles, \n", " and the IQR.\n", " \"\"\"\n", " \n", " ages = ages_in_class(klass)\n", " count = len(ages)\n", " min_age = min(ages)\n", " max_age = max(ages)\n", " \n", " q1 = np.percentile(ages, 25)\n", " q2 = np.percentile(ages, 50)\n", " q3 = np.percentile(ages, 75)\n", " iqr = q3 - q1\n", " \n", " return (count, min_age, max_age, q1, q2, q3, iqr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1st class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "count_1st, min_1st, max_1st, q1_1st, q2_1st, q3_1st, iqr_1st = age_iqr_for_class('1st')\n", "\n", "print(f'count_1st = {count_1st}')\n", "print(f'min_1st = {min_1st:.2f}')\n", "print(f'max_1st = {max_1st}')\n", "print(f'q1_1st = {q1_1st}')\n", "print(f'q2_1st = {q2_1st}')\n", "print(f'q3_1st = {q3_1st}')\n", "print(f'iqr_1st = {iqr_1st}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2nd class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "count_2nd, min_2nd, max_2nd, q1_2nd, q2_2nd, q3_2nd, iqr_2nd = age_iqr_for_class('2nd')\n", "\n", "print(f'count_2nd = {count_2nd}')\n", "print(f'min_2nd = {min_2nd:.2f}')\n", "print(f'max_2nd = {max_2nd}')\n", "print(f'q1_2nd = {q1_2nd}')\n", "print(f'q2_2nd = {q2_2nd}')\n", "print(f'q3_2nd = {q3_2nd}')\n", "print(f'iqr_2nd = {iqr_2nd}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3rd class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "count_3rd, min_3rd, max_3rd, q1_3rd, q2_3rd, q3_3rd, iqr_3rd = age_iqr_for_class('3rd')\n", "\n", "print(f'count_3rd = {count_3rd}')\n", "print(f'min_3rd = {min_3rd:.2f}')\n", "print(f'max_3rd = {max_3rd}')\n", "print(f'q1_3rd = {q1_3rd}')\n", "print(f'q2_3rd = {q2_3rd}')\n", "print(f'q3_3rd = {q3_3rd}')\n", "print(f'iqr_3rd = {iqr_3rd}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Box chart of age ranges within each passenger class" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import statistics\n", "\n", "print(f'min(ages_1st_class) = {min(ages_1st_class):.1f}')\n", "print(f'max(ages_1st_class) = {max(ages_1st_class):.1f}')\n", "print()\n", "print(f'min(ages_2nd_class) = {min(ages_2nd_class):.1f}')\n", "print(f'max(ages_2nd_class) = {max(ages_2nd_class):.1f}')\n", "print()\n", "print(f'min(ages_3rd_class) = {min(ages_3rd_class):.1f}')\n", "print(f'max(ages_3rd_class) = {max(ages_3rd_class):.1f}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "\n", "print('Box chart of age ranges by class')\n", "boxplot = sns.boxplot(y='age', x='passengerClass', data=titanic)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Explanation of box charts\n", "\n", "![Boxplot](boxplot.png)\n", "\n", "https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"An outlier is an observation that is numerically distant from the rest of the data. When reviewing a boxplot, an outlier is defined as a data point that is located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the interquartile range above the upper quartile and below the lower quartile).\"\n", "\n", "https://www.r-statistics.com/2011/01/how-to-label-all-the-outliers-in-a-boxplot/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Insightful conclusions? What decisions to make?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Further analyses: survival rates in each passenger class by age and by gender, etc." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }