{ "cells": [ { "cell_type": "markdown", "id": "30f923ef-bc73-4a5b-9359-20ef6e734bb3", "metadata": {}, "source": [ "# Replace missing ages\n", "#### We analyze the passenger survival data from the sinking of the Titanic to discover how survival, sex, age, and passenger class are correlated. Was it \"women and children first\" into the lifeboats? The CVS file `TitanicSurvivors.csv` contains one record per passenger, and each passenger has a name (maximum 31 characters) and three attributes: survived (yes or no), sex (male or female), age, and passenger class (1st, 2nd, or 3rd). The data is sorted first by passenger class and then by name. The first few lines of the file:\n", "```\n", "\"\",\"survived\",\"sex\",\"age\",\"passengerClass\"\n", "\"Allen, Miss. Elisabeth Walton\",\"yes\",\"female\",29,\"1st\"\n", "\"Allison, Master. Hudson Trevor\",\"yes\",\"male\",0.916700006,\"1st\"\n", "\"Allison, Miss. Helen Loraine\",\"no\",\"female\",2,\"1st\"\n", "\"Allison, Mr. Hudson Joshua Crei\",\"no\",\"male\",30,\"1st\"\n", "\"Allison, Mrs. Hudson J C (Bessi\",\"no\",\"female\",25,\"1st\"\n", "\"Anderson, Mr. Harry\",\"yes\",\"male\",48,\"1st\"\n", "\"Andrews, Miss. Kornelia Theodos\",\"yes\",\"female\",63,\"1st\"\n", "\"Andrews, Mr. Thomas Jr\",\"no\",\"male\",39,\"1st\"\n", "\"Appleton, Mrs. Edward Dale (Cha\",\"yes\",\"female\",53,\"1st\"\n", "\"Artagaveytia, Mr. Ramon\",\"no\",\"male\",71,\"1st\"\n", "\"Astor, Col. John Jacob\",\"no\",\"male\",47,\"1st\"\n", "\"Astor, Mrs. John Jacob (Madelei\",\"yes\",\"female\",18,\"1st\"\n", "\"Aubart, Mme. Leontine Pauline\",\"yes\",\"female\",24,\"1st\"\n", "\"Barber, Miss. Ellen Nellie\",\"yes\",\"female\",26,\"1st\"\n", "\"Barkworth, Mr. Algernon Henry W\",\"yes\",\"male\",80,\"1st\"\n", "\"Baumann, Mr. John D\",\"no\",\"male\",NA,\"1st\"\n", "\"Baxter, Mr. Quigg Edmond\",\"no\",\"male\",24,\"1st\"\n", "\"Baxter, Mrs. James (Helene DeLa\",\"yes\",\"female\",50,\"1st\"\n", "```\n", "#### The name column does not have a header, but the passenger names are irrelevant to the analysis. Babies have fractional ages (for example, the first Allison). Not all ages were recorded, and missing ages were entered as `NA` (for example, Baumann).\n", "#### Because our analysis includes the passengers' ages, what should we do about the missing ages? We can either\n", "- #### Throw out the records with missing ages. But that can cause problems from losing other information about those passengers.\n", "- #### Replace each missing ages with a placeholder value. But doing so can introduce bias into the analysis.\n", "\n", "# We decide to replace each missing age with an ***average age***. \n", "#### What is the best average to use?\n", "#### We should replace each missing age by an average age that is ***most representative*** of the known ages in its subgroup -- the average age of passengers with similar attribute values. However, we should not replace a number of missing ages by an average of fewer known ages. In other words, out of 10 ages, it should be OK to replace 2 missing ages by the average of the 8 known ages, but not replace 8 missing ages by the average of the 2 known ages.\n", "#### We must ***explore the data*** to find the best average to use." ] }, { "cell_type": "code", "execution_count": null, "id": "3084addd-7a32-43f9-b136-aea64cbe47da", "metadata": {}, "outputs": [], "source": [ "import csv\n", "import numpy as np\n", "import pandas as pd\n", "from pandas import DataFrame" ] }, { "cell_type": "markdown", "id": "fea42893-487b-47d9-afa4-6a179386125f", "metadata": {}, "source": [ "## Global constants" ] }, { "cell_type": "code", "execution_count": null, "id": "f96df271-491d-42c6-a0c6-44ae4777741b", "metadata": {}, "outputs": [], "source": [ "CSV_FILE = 'TitanicSurvival.csv'\n", "\n", "CLASS_COUNT = 4\n", "CLASS_1 = 0\n", "CLASS_2 = 1\n", "CLASS_3 = 2\n", "\n", "SEX_COUNT = 2\n", "SEX_FEMALE = 0\n", "SEX_MALE = 1\n", "\n", "SURVIVED_COUNT = 2\n", "SURVIVED_NO = 0\n", "SURVIVED_YES = 1\n", "\n", "AGE_COUNT = 2\n", "AGE_NA = 0\n", "AGE_OK = 1" ] }, { "cell_type": "markdown", "id": "dcf339d6-3292-4a6d-bf03-386a5adb3dc2", "metadata": {}, "source": [ "## Read the Titanic Survival CSV data file\n", "#### To aid the analysis, transform each record field:\n", "- survived\n", " - 0 = no\n", " - 1 = yes\n", "- sex\n", " - 0 = female\n", " - 1 = male\n", "- age\n", " - -1 = NA\n", "- passenger class\n", " - 1 = 1st\n", " - 2 = 2nd\n", " - 3 = 3rd\n", "#### The lists `Survived`, `Sex`, `Age`, and `Pclass` will contain values from all the records." ] }, { "cell_type": "code", "execution_count": null, "id": "5eb665fc-3298-4c7e-a022-67dc0d1be01f", "metadata": {}, "outputs": [], "source": [ "def read_data(filename):\n", " \"\"\"\n", " Read the Titanic survival data. Only use rows\n", " where age is not NA. Transform sex to 0 for female\n", " and 1 for male. Transform passenger class to 1, 2, or 3.\n", " Transform survived to 0 for no and 1 for yes.\n", " Return lists Pclass, Sex, Age, and Survived.\n", " \"\"\"\n", " first = True\n", " i = 0\n", "\n", " Name = []\n", " Pclass = []\n", " Sex = []\n", " Survived = []\n", " Age = []\n", "\n", " with open(filename, newline='') as csv_file:\n", " data = csv.reader(csv_file, delimiter=',', quotechar='\"')\n", "\n", " for row in data:\n", " \n", " # Skip the first (header) row.\n", " if not first:\n", " name, survived, sex, age, pclass = row\n", " \n", " age = -1 if age== 'NA' else float(age)\n", "\n", " survived = 0 if survived == 'no' else \\\n", " 1\n", " sex = 0 if sex == 'female' else \\\n", " 1\n", " pclass = 1 if pclass == '1st' else \\\n", " 2 if pclass == '2nd' else \\\n", " 3\n", "\n", " Age.append(age)\n", " Sex.append(sex)\n", " Survived.append(survived)\n", " Pclass.append(pclass)\n", "\n", " first = False\n", " \n", " return Survived, Sex, Age, Pclass" ] }, { "cell_type": "code", "execution_count": null, "id": "86f7376b-b1a5-4b24-a1c6-a3498aa9b40b", "metadata": {}, "outputs": [], "source": [ "Survived, Sex, Age, Pclass = read_data(CSV_FILE)" ] }, { "cell_type": "markdown", "id": "780000f4-a384-4b4a-b5b7-f44f881e1695", "metadata": {}, "source": [ "## Count ages\n", "#### Count how many ages are missing or not. If we consider survived, sex, age (missing or not), and passenger class attributes to be dimensions, we can represent the data as a 4-dimensional ***hypercube*** `age_counts` and use the transformed field values to index into it. The hypercube will allow us to easily sum along each attribute." ] }, { "cell_type": "code", "execution_count": null, "id": "390c4285-59b2-4572-9392-ac3b968c143a", "metadata": {}, "outputs": [], "source": [ "def count_ages(Pclass, Sex, Survived, Age):\n", " \"\"\"\n", " Create a 4-dimensional hypercube to count ages in\n", " each of the class, sex, survived, and age dimensions.\n", " Return the hypercube.\n", " \"\"\"\n", " multidimensional_list = \\\n", " [0]*CLASS_COUNT*SEX_COUNT*SURVIVED_COUNT*AGE_COUNT\n", "\n", " hypercube = np.array(multidimensional_list)\\\n", " .reshape(CLASS_COUNT, SEX_COUNT, SURVIVED_COUNT, AGE_COUNT)\n", " \n", " for klass, sex, survived, age in zip(Pclass, Sex, Survived, Age):\n", " if age < 0:\n", " hypercube[klass][sex][survived][AGE_NA] += 1\n", " else:\n", " hypercube[klass][sex][survived][AGE_OK] += 1\n", " \n", " return hypercube" ] }, { "cell_type": "code", "execution_count": null, "id": "bfc40f7a-4381-4732-bbe3-5b636611217e", "metadata": {}, "outputs": [], "source": [ "age_counts = count_ages(Pclass, Sex, Survived, Age)\n", "print('\\nHYPERCUBE OF AGE COUNTS')\n", "print(age_counts)" ] }, { "cell_type": "markdown", "id": "c4577685-78cf-42a5-a39a-ab2dcf6de1b3", "metadata": {}, "source": [ "## Age counts by passenger class\n", "#### In a dataframe, show the counts of missing ages and their percentages by passenger class." ] }, { "cell_type": "code", "execution_count": null, "id": "e7d1fdc1-00d6-4f2a-9ef9-1ccb671d135e", "metadata": {}, "outputs": [], "source": [ "def print_age_counts_by_class(age_counts):\n", " \"\"\"\n", " Display the counts and percentages of missing ages\n", " in each class.\n", " \"\"\"\n", " table = []\n", " \n", " # Overall\n", " all_count = np.sum(age_counts[:, :, :, :])\n", " all_na_count = np.sum(age_counts[:, :, :, AGE_NA])\n", " all_na_pct = 100 * all_na_count/all_count\n", " \n", " table.append(['all', all_count, all_na_count, all_na_pct])\n", " \n", " # By passenger class\n", " for klass in range(1, CLASS_COUNT):\n", " class_count = np.sum(age_counts[klass, :, :, :])\n", " class_na_count = np.sum(age_counts[klass, :, :, AGE_NA])\n", " class_na_pct = 100 * class_na_count/class_count\n", " \n", " table.append([klass, class_count, class_na_count, class_na_pct])\n", " \n", " df = DataFrame(table)\n", " df.columns = ['class', 'count', 'NAs', '% NAs']\n", " display(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "00d27738-6bc3-4ccf-a23e-f0911bca98e7", "metadata": {}, "outputs": [], "source": [ "print('\\n% MISSING AGES BY CLASS')\n", "print_age_counts_by_class(age_counts)" ] }, { "cell_type": "markdown", "id": "4d97831b-1bd7-44dd-a7b2-8f5228cba938", "metadata": {}, "source": [ "## Age counts by sex by passenger class\n", "#### In a dataframe, show the counts of missing ages and their percentages by sex within each passenger class." ] }, { "cell_type": "code", "execution_count": null, "id": "e6fd4a6c-9b19-460a-9e9d-7603be27d8b0", "metadata": {}, "outputs": [], "source": [ "def print_age_counts_by_sex(age_counts):\n", " \"\"\"\n", " Display the counts and percentages of missing ages\n", " in the subgroups by sex by class.\n", " \"\"\"\n", " table = []\n", " \n", " # By passenger class\n", " for klass in range(1, CLASS_COUNT):\n", " \n", " # Female NA\n", " female_count = np.sum(age_counts[klass, SEX_FEMALE, :, :])\n", " female_na_count = np.sum(age_counts[klass, SEX_FEMALE, :, AGE_NA])\n", " female_na_pct = 100 * female_na_count/female_count\n", " \n", " # Male NA\n", " male_count = np.sum(age_counts[klass, SEX_MALE, :, :])\n", " male_na_count = np.sum(age_counts[klass, SEX_MALE, :, AGE_NA])\n", " male_na_pct = 100 * male_na_count/male_count\n", "\n", " table.append([klass, female_count, female_na_count, female_na_pct,\n", " male_count, male_na_count, male_na_pct])\n", "\n", " df = DataFrame(table)\n", " df.columns = ['class', 'female count', 'female NAs', 'female % NAs',\n", " 'male count', 'male NAs', 'male % NAs']\n", " display(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "e439991e-bef7-4254-9ca8-f26f220729ae", "metadata": {}, "outputs": [], "source": [ "print('\\n% MISSING AGES BY SEX BY CLASS')\n", "print_age_counts_by_sex(age_counts)" ] }, { "cell_type": "markdown", "id": "a349a80e-63a8-48d3-ba76-c4341129a028", "metadata": {}, "source": [ "## Age counts by survival by sex by passenger class\n", "#### In a dataframe, show the counts of missing ages and their percentages by survival by sex within each passenger class." ] }, { "cell_type": "code", "execution_count": null, "id": "d40a8463-a8dd-45f1-b63a-37a59f9fc903", "metadata": {}, "outputs": [], "source": [ "def print_age_counts_by_survived(age_counts):\n", " \"\"\"\n", " Display the counts and percentages of missing ages\n", " in the subgroups by survived by sex by class.\n", " \"\"\"\n", " table = []\n", " \n", " # By passenger class\n", " for klass in range(1, CLASS_COUNT):\n", " \n", " #Female perished NA\n", " female_perished_count = np.sum(age_counts[klass, SEX_FEMALE, SURVIVED_NO, :])\n", " female_perished_na_count = np.sum(age_counts[klass, SEX_FEMALE, SURVIVED_NO, AGE_NA])\n", " female_perished_na_pct = 100 * female_perished_na_count/female_perished_count\n", " \n", " # Male perished NA\n", " male_perished_count = np.sum(age_counts[klass, SEX_MALE, SURVIVED_NO, :])\n", " male_perished_na_count = np.sum(age_counts[klass, SEX_MALE, SURVIVED_NO, AGE_NA])\n", " male_perished_na_pct = 100 * male_perished_na_count/male_perished_count\n", " \n", " table.append([klass, 'no', female_perished_count, \n", " female_perished_na_count, \n", " female_perished_na_pct,\n", " male_perished_count, \n", " male_perished_na_count, \n", " male_perished_na_pct])\n", " \n", " # Female survived NA\n", " female_survived_count = np.sum(age_counts[klass, SEX_FEMALE, SURVIVED_YES, :])\n", " female_survived_na_count = np.sum(age_counts[klass, SEX_FEMALE, SURVIVED_YES, AGE_NA])\n", " female_survived_na_pct = 100 * female_survived_na_count/female_survived_count\n", " \n", " # Male survived NA\n", " male_survived_count = np.sum(age_counts[klass, SEX_MALE, SURVIVED_YES, :])\n", " male_survived_na_count = np.sum(age_counts[klass, SEX_MALE, SURVIVED_YES, AGE_NA])\n", " male_survived_na_pct = 100 * male_survived_na_count/male_survived_count\n", " \n", " table.append([klass, 'yes', female_survived_count, \n", " female_survived_na_count, \n", " female_survived_na_pct,\n", " male_survived_count, \n", " male_survived_na_count, \n", " male_survived_na_pct])\n", " \n", " df = DataFrame(table)\n", " df.columns = ['class', 'survived', 'female count', 'female NAs', 'female % NAs',\n", " 'male count', 'male NAs', 'male % NAs']\n", " display(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "a4d2832c-860b-49aa-b106-932ef16bbf0f", "metadata": {}, "outputs": [], "source": [ "print('\\n% MISSING AGES BY SURVIVAL BY SEX BY CLASS')\n", "print_age_counts_by_survived(age_counts)" ] }, { "cell_type": "markdown", "id": "38af63e9-175b-4e06-a65f-c3f9efa2ca20", "metadata": {}, "source": [ "### The above tables show that it should be OK to replace each missing age by the average of known ages in its smallest subgroup. The percentage of missing ages in each subgroup should be significantly less than 50%. We see that the replacements in 3rd class (> 30% and > 21%) will be more probematical than in the other classes." ] }, { "cell_type": "markdown", "id": "5dd11163-7a6f-4ecf-9953-45b5d577225d", "metadata": {}, "source": [ "## Skip rows where age is `NA`\n", "#### Compute some baseline statistics with the original data but without the missing ages." ] }, { "cell_type": "code", "execution_count": null, "id": "0795857e-1ad8-4b18-908b-cacdf42a2f1d", "metadata": {}, "outputs": [], "source": [ "def skip_na(Survived, Sex, Age, Pclass):\n", " \"\"\"\n", " Return copies of the Survived, Sex, Age, and Pclass lists\n", " without the rows where age is NA.\n", " \"\"\"\n", " Survived_no_na = []\n", " Sex_no_na = []\n", " Age_no_na = []\n", " Pclass_no_na = []\n", " \n", " for survived, sex, age, pclass in zip(Survived, Sex, Age, Pclass):\n", " if age >= 0:\n", " Survived_no_na.append(survived)\n", " Sex_no_na.append(sex)\n", " Age_no_na.append(age)\n", " Pclass_no_na.append(pclass)\n", " \n", " return Survived_no_na, Sex_no_na, Age_no_na, Pclass_no_na" ] }, { "cell_type": "code", "execution_count": null, "id": "545e245a-8f6f-4779-a857-fa3e2119ec4b", "metadata": {}, "outputs": [], "source": [ "Survived_no_na, Sex_no_na, Age_no_na, Pclass_no_na = \\\n", " skip_na(Survived, Sex, Age, Pclass)" ] }, { "cell_type": "markdown", "id": "8dc21c6b-1cea-43e4-b08a-3b44d0ebf4ba", "metadata": {}, "source": [ "### Cache of ages\n", "#### Dictionary `averages` will cache age counts in each dimension." ] }, { "cell_type": "code", "execution_count": null, "id": "4b9a32f3-ea0f-4e1a-a3b9-59ebbc6da61a", "metadata": {}, "outputs": [], "source": [ "averages = {}" ] }, { "cell_type": "markdown", "id": "5cbdbc2e-6343-43f9-925c-ad0eb6875cc4", "metadata": {}, "source": [ "## Average ages by passenger class\n", "#### Calculate the average age of each class." ] }, { "cell_type": "code", "execution_count": null, "id": "a9c2f137-f8c2-4c3f-b35c-8bae3cb96095", "metadata": {}, "outputs": [], "source": [ "def average_age_class(target_class, classes, ages, averages):\n", " \"\"\"\n", " Calculate the average age of the target passenger class \n", " using the classes and ages lists and return the average.\n", " \"\"\"\n", " if target_class == 0:\n", " avg = sum(ages)/len(ages) # overall\n", " else:\n", " count = 0\n", " total = 0\n", "\n", " # By target passenger class\n", " for klass, age in zip(classes, ages):\n", " if klass == target_class: \n", " count += 1\n", " total += age\n", "\n", " avg = total/count\n", " \n", " return avg" ] }, { "cell_type": "markdown", "id": "e3548a31-b96b-41e3-9550-a2515df393bb", "metadata": {}, "source": [ "#### Print the average age of each class." ] }, { "cell_type": "code", "execution_count": null, "id": "ad1e01e5-83b7-4976-b980-264d5b693ad2", "metadata": {}, "outputs": [], "source": [ "def print_average_age_class(classes, ages, averages):\n", " \"\"\"\n", " Print the average age of each passenger class using\n", " the classes and ages lists and the averages dictionary.\n", " \"\"\"\n", " row = []\n", " \n", " for klass in range(CLASS_COUNT):\n", " row.append(average_age_class(klass, classes, ages, \n", " averages))\n", "\n", " df = DataFrame([row])\n", " df.columns = ['all', '1st', '2nd', '3rd']\n", " display(df)" ] }, { "cell_type": "markdown", "id": "26a97f36-96d6-4375-85f5-340625e27827", "metadata": {}, "source": [ "### Our baseline is the average ages by class without the missing ages." ] }, { "cell_type": "code", "execution_count": null, "id": "ab3d4f89-b67a-4f5a-af8a-87560ee07fde", "metadata": {}, "outputs": [], "source": [ "print('\\nAVERAGE AGE BY CLASS'\n", " '\\nWITHOUT MISSING AGES')\n", "print_average_age_class(Pclass_no_na, Age_no_na, averages)" ] }, { "cell_type": "markdown", "id": "3bfda7c2-b205-43cf-a9ca-cb53e188574a", "metadata": {}, "source": [ "#### Replace the missing age in each class." ] }, { "cell_type": "code", "execution_count": null, "id": "7e8cc7f2-cf3d-43a7-b8e2-d77a48c760b8", "metadata": {}, "outputs": [], "source": [ "def replace_by_class(classes, ages, averages):\n", " \"\"\"\n", " Replace each missing age by the passenger class average using\n", " the passenger classes and ages lists and the averages dictionary.\n", " \"\"\"\n", " clean = []\n", " \n", " for klass, age in zip(classes, ages):\n", " clean.append(age if age >= 0 \\\n", " else average_age_class(klass, classes, \n", " ages, averages))\n", "\n", " return clean" ] }, { "cell_type": "code", "execution_count": null, "id": "c25a5215-532c-4d13-a6aa-dbf3631c86e3", "metadata": {}, "outputs": [], "source": [ "print('\\nAVERAGE AGE BY CLASS'\n", " '\\nAFTER REPLACEMENTS WITH CLASS AVERAGES')\n", "Age_clean_class = replace_by_class(Pclass, Age, averages)\n", "print_average_age_class(Pclass, Age_clean_class, averages)" ] }, { "cell_type": "markdown", "id": "e732e73a-0300-406f-a0f5-cdaa6d6867ae", "metadata": {}, "source": [ "## Average age of each sex subgroup\n", "#### Calculate the average age of each sex subgroup in each class." ] }, { "cell_type": "code", "execution_count": null, "id": "875d411e-d16a-4cdf-bb04-cd56c069a8a7", "metadata": {}, "outputs": [], "source": [ "def average_age_sex(target_class, target_sex, \n", " classes, sexes, ages, averages):\n", " \"\"\"\n", " Retrieve the average age of the target sex within \n", " the target passenger class from the averages dictionary \n", " if it's in there. Otherwise, calculate the average using \n", " the classes, sexes, and ages lists and enter it into \n", " the dictionary. Return the average.\n", " \"\"\"\n", " key = f'{target_class}-{target_sex}'\n", " \n", " if key in averages:\n", " return averages[key]\n", " \n", " if target_class == 0:\n", " avg = sum(ages)/len(ages) # overall\n", " else:\n", " count = 0\n", " total = 0\n", " \n", " # By target passenger class and sex\n", " for klass, sex, age in zip(classes, sexes, ages):\n", " if (klass == target_class) \\\n", " and (sex == target_sex):\n", " count += 1\n", " total += age\n", "\n", " avg = total/count\n", " \n", " averages[key] = avg\n", " return avg" ] }, { "cell_type": "markdown", "id": "6e47dede-c3f9-4727-9e80-b282a3fa7399", "metadata": {}, "source": [ "#### Print the average age of each sex subgroup." ] }, { "cell_type": "code", "execution_count": null, "id": "1abda232-e407-420b-bdbd-3948bbc61eac", "metadata": {}, "outputs": [], "source": [ "def print_average_age_sex(classes, sexes, ages, averages):\n", " \"\"\"\n", " Print the average age of each sex in each passenger \n", " class using the classes, sexes, and ages lists and \n", " the averages dictionary.\n", " \"\"\"\n", " table = []\n", " \n", " for klass in range(CLASS_COUNT):\n", " table.append(['all' if klass == 0 else klass,\n", " average_age_sex(klass, SEX_FEMALE, \n", " classes, sexes, \n", " ages, averages),\n", " average_age_sex(klass, SEX_MALE, \n", " classes, sexes, \n", " ages, averages)])\n", "\n", " df = DataFrame(table)\n", " df.columns=['class', 'female average', 'male average']\n", " display(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "21cf1a33-bee8-40f8-97d9-a27b6ee9fe31", "metadata": {}, "outputs": [], "source": [ "print('\\nAVERAGE AGES BY SEX BY CLASS'\n", " '\\nWITHOUT MISSING AGES')\n", "print_average_age_sex(Pclass_no_na, Sex_no_na, Age_no_na, averages)" ] }, { "cell_type": "markdown", "id": "dcdc4a16-d44a-4df6-a51b-82c752e46f1b", "metadata": {}, "source": [ "#### Replace the missing age in each sex subgroup." ] }, { "cell_type": "code", "execution_count": null, "id": "f83bffbc-f08a-409c-9119-86cce046a0e6", "metadata": {}, "outputs": [], "source": [ "def replace_by_sex(classes, sexes, ages, averages):\n", " \"\"\"\n", " Replace each missing age by the average age in each sex \n", " subgroup within each class using the classes, sexes, and \n", " ages lists and the averages dictionary.\n", " \"\"\"\n", " clean = []\n", " \n", " for klass, sex, age in zip(classes, sexes, ages):\n", " clean.append(age if age >= 0 \\\n", " else average_age_sex(klass, sex, \n", " classes, sexes, \n", " ages, averages))\n", "\n", " return clean" ] }, { "cell_type": "code", "execution_count": null, "id": "7d90cd8f-0269-4448-995b-0cb62aebb892", "metadata": {}, "outputs": [], "source": [ "print('\\nAVERAGE AGES BY SEX BY CLASS'\n", " '\\nAFTER REPLACEMENTS WITH SEX SUBGROUP AVERAGES')\n", "Age_clean_sex = replace_by_sex(Pclass, Sex, Age, averages)\n", "print_average_age_class(Pclass, Age_clean_sex, averages)" ] }, { "cell_type": "markdown", "id": "491fc0ad-c1c6-4e8d-9f4f-46c015119352", "metadata": {}, "source": [ "## Average age of each survived subgroup\n", "#### Calculate the average age of each survived subgroup in each sex subgroup in each passenger class. by survival status by sex by passenger class." ] }, { "cell_type": "code", "execution_count": null, "id": "37fe3426-82c5-4bd4-b3f9-429ee8073867", "metadata": {}, "outputs": [], "source": [ "def average_age_survived(target_class, target_sex, target_survived,\n", " classes, sexes, surviveds, ages, averages):\n", " \"\"\"\n", " Retrieve the average age of the target survived subgroup\n", " in the target sex subgroup in the target passenger class \n", " from the averages dictionary if it's in there. Otherwise, \n", " calculate the average using the classes, sexes, surviveds, \n", " and ages lists and enter it into the dictionary. Return \n", " the average.\n", " \"\"\"\n", " key = f'{target_class}-{target_sex}-{target_survived}'\n", " \n", " if key in averages:\n", " return averages[key]\n", " \n", " if target_class == 0:\n", " avg = sum(ages)/len(ages) # overall\n", " else:\n", " count = 0\n", " total = 0\n", " \n", " # By target passenger class, sex, and survival\n", " for klass, sex, survived, age \\\n", " in zip(classes, sexes, surviveds, ages):\n", " if (klass == target_class) \\\n", " and (sex == target_sex) \\\n", " and (survived == target_survived):\n", " count += 1\n", " total += age\n", "\n", " avg = total/count\n", " \n", " averages[key] = avg\n", " return avg" ] }, { "cell_type": "markdown", "id": "d4be2d3f-07d2-49a1-bc25-2c473292363f", "metadata": {}, "source": [ "#### Print the average age of each survived subgroup." ] }, { "cell_type": "code", "execution_count": null, "id": "8643f969-71af-4578-ba42-97f8d7284033", "metadata": {}, "outputs": [], "source": [ "def print_average_age_survived(classes, sexes, surviveds, ages, averages):\n", " \"\"\"\n", " Print the average age of each survived subgroup in each \n", " sex subgroup in each passenger class using the classes, \n", " sexes, surviveds, and ages lists and the averages dictionary.\n", " \"\"\"\n", " table = []\n", "\n", " for klass in range(1, CLASS_COUNT):\n", " table.append([klass,\n", " average_age_survived(klass, SEX_FEMALE, SURVIVED_NO,\n", " classes, sexes, surviveds, ages, \n", " averages),\n", " average_age_survived(klass, SEX_MALE, SURVIVED_NO,\n", " classes, sexes, surviveds, ages, \n", " averages),\n", " average_age_survived(klass, SEX_FEMALE, SURVIVED_YES,\n", " classes, sexes, surviveds, ages, \n", " averages),\n", " average_age_survived(klass, SEX_MALE, SURVIVED_YES,\n", " classes, sexes, surviveds, ages, \n", " averages)])\n", "\n", " df = DataFrame(table)\n", " df.columns=['class', 'female perished', 'male perished', \n", " 'female survived', 'male survived']\n", " display(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "e698d55b-2741-45d3-9dce-13f447dc797a", "metadata": {}, "outputs": [], "source": [ "print('\\nAVERAGE AGES BY SURVIVED BY SEX BY CLASS'\n", " '\\nWITHOUT MISSING AGES') \n", "print_average_age_survived(Pclass_no_na, Sex_no_na, \n", " Survived_no_na, Age_no_na, averages)" ] }, { "cell_type": "markdown", "id": "145f2d6b-df92-4e14-baea-4099438a58d1", "metadata": {}, "source": [ "#### Replace the missing ages in each survived subgroup." ] }, { "cell_type": "code", "execution_count": null, "id": "8a276abf-3ced-411b-9c73-0574ade0a9da", "metadata": {}, "outputs": [], "source": [ "def replace_by_survived(classes, sexes, surviveds, ages, averages):\n", " \"\"\"\n", " Replace each missing age by the average of the survived subggroup\n", " in each survival subgroup in each sex within each class using \n", " the classes, sexes, and ages lists and the averages dictionary.\n", " \"\"\"\n", " clean = []\n", " \n", " for klass, sex, survived, age \\\n", " in zip(classes, sexes, surviveds, ages):\n", " clean.append(age if age >= 0 \\\n", " else average_age_survived(klass, sex, \n", " survived, classes, \n", " sexes, surviveds,\n", " ages, averages))\n", "\n", " return clean" ] }, { "cell_type": "code", "execution_count": null, "id": "cfaee26a-f545-4032-9444-c35076f545f8", "metadata": {}, "outputs": [], "source": [ "print('\\nAVERAGE AGES BY SURVIVED BY SEX BY CLASS'\n", " '\\nAFTER REPLACEMENTS WITH SURVIVED SUBGROUP AVERAGES')\n", "Age_clean_survived = replace_by_survived(Pclass, Sex, Survived, Age, averages)\n", "print_average_age_class(Pclass, Age_clean_survived, averages)" ] }, { "cell_type": "markdown", "id": "af1aa4a6-81cc-4e99-a40a-2037d418e2f2", "metadata": {}, "source": [ "## Correlation matrix" ] }, { "cell_type": "code", "execution_count": null, "id": "ac704ead-47e9-4dbe-a7ed-6fbe46e78855", "metadata": {}, "outputs": [], "source": [ "def print_correlation_matrix(classes, sexes, surviveds, ages):\n", " \"\"\"\n", " Display the pairwise correlation matrix.\n", " \"\"\"\n", " M = np.array([classes, sexes, surviveds, ages]).T\n", "\n", " df = DataFrame(M)\n", " df.columns = ['Class', 'Sex', 'Survived', 'Age']\n", "\n", " display(df.corr(method='pearson', min_periods=1))" ] }, { "cell_type": "code", "execution_count": null, "id": "775def39-78a9-42a2-8291-e4a4bf71e6ba", "metadata": {}, "outputs": [], "source": [ "print('\\nPAIRWISE CORRELATION MATRIX WITHOUT MISSING AGES')\n", "print_correlation_matrix(Pclass_no_na, Sex_no_na, \n", " Survived_no_na, Age_no_na)\n", "\n", "print('\\nPAIRWISE CORRELATION MATRIX WITH REPLACED AGES')\n", "print_correlation_matrix(Pclass, Sex, Survived, Age_clean_survived)" ] }, { "cell_type": "markdown", "id": "b8d695fb-5dbb-498c-b08f-bf54a9297950", "metadata": {}, "source": [ "## Main" ] }, { "cell_type": "code", "execution_count": null, "id": "98a78203-f238-452b-87ca-f9266b2e7826", "metadata": {}, "outputs": [], "source": [ "def main():\n", " Survived, Sex, Age, Pclass = read_data(CSV_FILE)\n", " \n", " Survived_no_na, Sex_no_na, Age_no_na, Pclass_no_na = \\\n", " skip_na(Survived, Sex, Age, Pclass)\n", " \n", " age_counts = count_ages(Pclass, Sex, Survived, Age)\n", " print('\\nHYPERCUBE OF AGE COUNTS')\n", " print(age_counts)\n", "\n", " print('\\n% MISSING AGES BY CLASS')\n", " print_age_counts_by_class(age_counts)\n", " \n", " print('\\n% MISSING AGES BY SEX BY CLASS')\n", " print_age_counts_by_sex(age_counts)\n", " \n", " print('\\n% MISSING AGES BY SURVIVAL BY SEX BY CLASS')\n", " print_age_counts_by_survived(age_counts)\n", " \n", " averages = {}\n", " \n", " print('\\nAVERAGE AGE BY CLASS'\n", " '\\nWITHOUT MISSING AGES')\n", " print_average_age_class(Pclass_no_na, Age_no_na, averages)\n", " \n", " print('\\nAVERAGE AGE BY CLASS'\n", " '\\nAFTER REPLACEMENTS WITH CLASS AVERAGES')\n", " Age_clean_class = replace_by_class(Pclass, Age, averages)\n", " print_average_age_class(Pclass, Age_clean_class, averages)\n", " \n", " print('\\nAVERAGE AGES BY SEX BY CLASS'\n", " '\\nWITHOUT MISSING AGES')\n", " print_average_age_sex(Pclass_no_na, Sex_no_na, Age_no_na, averages)\n", " \n", " print('\\nAVERAGE AGES BY SEX BY CLASS'\n", " '\\nAFTER REPLACEMENTS WITH SEX SUBGROUP AVERAGES')\n", " Age_clean_sex = replace_by_sex(Pclass, Sex, Age, averages)\n", " print_average_age_class(Pclass, Age_clean_sex, averages)\n", " \n", " print('\\nAVERAGE AGES BY SURVIVED BY SEX BY CLASS'\n", " '\\nWITHOUT MISSING AGES') \n", " print_average_age_survived(Pclass_no_na, Sex_no_na, \n", " Survived_no_na, Age_no_na, averages)\n", "\n", " print('\\nAVERAGE AGES BY SURVIVED BY SEX BY CLASS'\n", " '\\nAFTER REPLACEMENTS WITH SURVIVED SUBGROUP AVERAGES')\n", " Age_clean_survived = replace_by_survived(Pclass, Sex, Survived, \n", " Age, averages)\n", " print_average_age_class(Pclass, Age_clean_survived, averages)\n", " \n", " print('\\nPAIRWISE CORRELATION MATRIX WITHOUT MISSING AGES')\n", " print_correlation_matrix(Pclass_no_na, Sex_no_na, \n", " Survived_no_na, Age_no_na)\n", " \n", " print('\\nPAIRWISE CORRELATION MATRIX WITH REPLACED AGES')\n", " print_correlation_matrix(Pclass, Sex, Survived, Age_clean_survived)" ] }, { "cell_type": "code", "execution_count": null, "id": "6faf3c17-8312-4376-8fc2-bf5e1897cf82", "metadata": {}, "outputs": [], "source": [ "main()" ] }, { "cell_type": "markdown", "id": "009bf935-0211-488e-b372-c2dd7d352609", "metadata": {}, "source": [ "## Conclusion\n", "#### As shown by the two pairwise correlation matrices, the age attribute correlates **slightly better** with each of the other attributes if we replace each missing person's age by the survived subgroup's average age. Age is the worse predictor of survival. Age is much better at predicting a person's passenger class. A person's sex is the best predictor of survival." ] }, { "cell_type": "markdown", "id": "03826693-ef12-4054-9051-f578bc450660", "metadata": {}, "source": [ "#### (c) 2023 by Ronald Mak" ] }, { "cell_type": "code", "execution_count": null, "id": "fe2ba894-3dbf-40ed-b42c-7af64e2fe7a7", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }