{ "cells": [ { "cell_type": "markdown", "id": "20ba0a44-4a8a-4312-ba42-c229c4b9cce3", "metadata": {}, "source": [ "# Replace missing values with averages\n", "#### Replace missing iages with subgroup averages entirely in the database. This greatly reduces network traffic and latencies between the client Python code and the remote database server." ] }, { "cell_type": "code", "execution_count": null, "id": "9d15681a-9b96-4e3d-b25a-5b3fcf4088d9", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pandas import DataFrame\n", "from DATA225utils import make_connection, dataframe_query" ] }, { "cell_type": "code", "execution_count": null, "id": "9f0a4808-a345-4bf8-89af-78ec706b0be2", "metadata": {}, "outputs": [], "source": [ "conn = make_connection(config_file = 'titanic.ini')\n", "cursor = conn.cursor()" ] }, { "cell_type": "markdown", "id": "eaae1c38-900a-4a9b-a5a7-6200c19393ed", "metadata": {}, "source": [ "## Age counts by class\n", "#### In a dataframe, show the counts of missing ages and their percentages by class." ] }, { "cell_type": "code", "execution_count": null, "id": "0bf12222-908a-4c22-8e49-b3096b7cd66b", "metadata": {}, "outputs": [], "source": [ "def print_age_counts_by_class():\n", " _, df = dataframe_query(conn, \n", " \"\"\"\n", " SELECT class, \n", " COUNT(*) AS class_count,\n", " COUNT(IF (age = 0, 1, NULL)) AS NA_count,\n", " 100*COUNT(IF (age = 0, 1, NULL))/COUNT(*) AS NA_pct\n", " FROM passengers\n", " GROUP BY class\n", " ORDER BY class\n", " \"\"\"\n", " )\n", "\n", " display(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "598e9e50-3e92-4980-b636-78bd954d82b6", "metadata": {}, "outputs": [], "source": [ "print('\\n% MISSING AGES BY CLASS')\n", "print_age_counts_by_class()" ] }, { "cell_type": "markdown", "id": "415da889-c7f7-4f41-a5cb-2385b6b0d090", "metadata": {}, "source": [ "## Age counts by sex by class\n", "#### In a dataframe, show the counts of missing ages and their percentages by sex by class." ] }, { "cell_type": "code", "execution_count": null, "id": "5721fcf9-102c-48f8-854c-eedff56f3480", "metadata": {}, "outputs": [], "source": [ "def print_age_counts_by_sex():\n", " _, df = dataframe_query(conn, \n", " \"\"\"\n", " SELECT class,\n", " COUNT(IF (sex = 'female', 1, NULL)) AS female_count,\n", " COUNT(IF (sex = 'female' AND age = 0, 1, NULL)) AS female_NA_count,\n", " 100*COUNT(IF (age = 0 AND sex = 'female', 1, NULL))\n", " /COUNT(IF (sex = 'female', 1, NULL)) AS female_NA_pct,\n", " COUNT(IF (sex = 'male', 1, NULL)) AS male_count,\n", " COUNT(IF (sex = 'male' AND age = 0, 1, NULL)) AS male_NA_count,\n", " 100*COUNT(IF (age = 0 AND sex = 'male', 1, NULL))\n", " /COUNT(IF (sex = 'male', 1, NULL)) AS male_NA_pct\n", " FROM passengers\n", " GROUP BY class\n", " ORDER BY class\n", " \"\"\"\n", " )\n", "\n", " display(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "b4ad3cfb-aa65-41d9-8a94-3e4373238bd1", "metadata": {}, "outputs": [], "source": [ "print('\\n% MISSING AGES BY SEX BY CLASS')\n", "print_age_counts_by_sex()" ] }, { "cell_type": "markdown", "id": "47af5aae-8b16-4747-a956-ca6671fa77d8", "metadata": {}, "source": [ "## Age counts by survived by sex by class\n", "#### In a dataframe, show the counts of missing ages and their percentages by surveved by sex by class." ] }, { "cell_type": "code", "execution_count": null, "id": "2484fa92-15db-4305-b31b-a7deb5a51c5a", "metadata": {}, "outputs": [], "source": [ "def print_age_counts_by_survived():\n", " _, df = dataframe_query(conn, \n", " \"\"\"\n", " SELECT class,\n", " survived,\n", " COUNT(IF (sex = 'female', 1, NULL)) AS female_count,\n", " COUNT(IF (sex = 'female' AND age = 0, 1, NULL)) AS female_NA_count,\n", " 100*COUNT(IF (age = 0 AND sex = 'female', 1, NULL))\n", " /COUNT(IF (sex = 'female', 1, NULL)) AS female_NA_pct,\n", " COUNT(IF (sex = 'male', 1, NULL)) AS male_count,\n", " COUNT(IF (sex = 'male' AND age = 0, 1, NULL)) AS male_NA_count,\n", " 100*COUNT(IF (age = 0 AND sex = 'male', 1, NULL))\n", " /COUNT(IF (sex = 'male', 1, NULL)) AS male_NA_pct\n", " FROM passengers\n", " GROUP BY class, survived\n", " ORDER BY class, survived\n", " \"\"\"\n", " )\n", "\n", " display(df)" ] }, { "cell_type": "code", "execution_count": null, "id": "7a999e37-472b-4b32-bc32-d1372049685a", "metadata": {}, "outputs": [], "source": [ "print('\\n% MISSING AGES BY SURVIVAL BY SEX BY CLASS')\n", "print_age_counts_by_survived()" ] }, { "cell_type": "markdown", "id": "e2e5082e-c645-406a-b1ba-b1ef97dab6b1", "metadata": {}, "source": [ "## Average age in each class" ] }, { "cell_type": "code", "execution_count": null, "id": "0d8cbb69-941e-4212-b018-f9efc708b74a", "metadata": {}, "outputs": [], "source": [ "def print_average_age_class():\n", " _, df = dataframe_query(conn, \n", " \"\"\"\n", " SELECT class, AVG(age)\n", " FROM passengers\n", " WHERE age > 0\n", " GROUP BY class\n", " ORDER BY class\n", " \"\"\"\n", " )\n", "\n", " display(df)" ] }, { "cell_type": "markdown", "id": "37b48c8d-6b02-4728-acb0-5a1442c87fa3", "metadata": {}, "source": [ "## A stored function to return average ages\n", "#### Given a passenger class, sex of the passenger, and survival status, return the average age of the passengers in that class. A `DETERMINISTIC` function always returns the same result for the same parameters values." ] }, { "cell_type": "code", "execution_count": null, "id": "8b109019-afd1-4f4e-9321-741ed31dfaf1", "metadata": {}, "outputs": [], "source": [ "def create_stored_function():\n", " cursor.execute('DROP FUNCTION IF EXISTS average_age')\n", "\n", " sql = ( \"\"\"\n", " CREATE FUNCTION average_age(class_parm VARCHAR(3), \n", " sex_parm VARCHAR(6), \n", " survived_parm VARCHAR(3))\n", " RETURNS DOUBLE\n", " DETERMINISTIC\n", "\n", " BEGIN\n", " DECLARE average DOUBLE;\n", "\n", " SET average = (\n", " SELECT AVG(age) AS average\n", " FROM passengers\n", " WHERE age > 0\n", " AND class = class_parm\n", " AND sex = sex_parm\n", " AND survived = survived_parm\n", " GROUP BY class, sex, survived\n", " );\n", "\n", " RETURN average;\n", " END\n", " \"\"\"\n", " )\n", "\n", " cursor.execute(sql)" ] }, { "cell_type": "code", "execution_count": null, "id": "60c740ae-8f16-494d-91ea-0bd353390999", "metadata": {}, "outputs": [], "source": [ "create_stored_function()" ] }, { "cell_type": "markdown", "id": "74132856-e1b5-407d-9802-be18d5288869", "metadata": {}, "source": [ "## Average age of each survived subgroup" ] }, { "cell_type": "code", "execution_count": null, "id": "b0d614b2-29bd-4400-8f7e-aa78f6e7da55", "metadata": {}, "outputs": [], "source": [ "def print_average_age_survived():\n", " table = []\n", "\n", " for klass in ['1st', '2nd', '3rd']:\n", " row = [klass]\n", "\n", " cursor.execute(f\"SELECT average_age('{klass}', 'female', 'no')\")\n", " row.append(cursor.fetchone()[0])\n", "\n", " cursor.execute(f\"SELECT average_age('{klass}', 'male', 'no')\")\n", " row.append(cursor.fetchone()[0])\n", "\n", " cursor.execute(f\"SELECT average_age('{klass}', 'female', 'yes')\")\n", " row.append(cursor.fetchone()[0])\n", "\n", " cursor.execute(f\"SELECT average_age('{klass}', 'male', 'yes')\")\n", " row.append(cursor.fetchone()[0])\n", "\n", " table.append(row)\n", "\n", " df = DataFrame(table)\n", " df.columns=['class', 'female perished', 'male perished', \n", " 'female survived', 'male survived']\n", "\n", " display(df)" ] }, { "cell_type": "markdown", "id": "e5236a29-ff25-445a-8c3f-5b01f7f00fef", "metadata": {}, "source": [ "## Replace the missing ages in each survived subgroup" ] }, { "cell_type": "code", "execution_count": null, "id": "030f2aaa-218b-4d68-9d85-0e8721541538", "metadata": {}, "outputs": [], "source": [ "def replace_missing_ages():\n", " cursor.execute('SET SQL_SAFE_UPDATES = 0')\n", "\n", " cursor.execute( \n", " \"\"\"\n", " UPDATE passengers\n", " SET age = average_age(class, sex, survived)\n", " WHERE age = 0\n", " \"\"\"\n", " )\n", "\n", " conn.commit()" ] }, { "cell_type": "markdown", "id": "208ef25c-565d-4e16-8afb-bfc6b3c86571", "metadata": {}, "source": [ "## Main" ] }, { "cell_type": "code", "execution_count": null, "id": "f66c5555-ed6e-43fc-ac3a-a1546f074abe", "metadata": {}, "outputs": [], "source": [ "print('\\n% MISSING AGES BY CLASS')\n", "print_age_counts_by_class()\n", "\n", "print('\\n% MISSING AGES BY SEX BY CLASS')\n", "print_age_counts_by_sex()\n", "\n", "print('\\n% MISSING AGES BY SURVIVAL BY SEX BY CLASS')\n", "print_age_counts_by_survived()\n", "\n", "create_stored_function()\n", "\n", "print('\\nAVERAGE AGES BY SURVIVED BY SEX BY CLASS'\n", " '\\nWITHOUT MISSING AGES') \n", "print_average_age_survived()\n", "\n", "replace_missing_ages()\n", "\n", "print('\\nAVERAGE AGES BY SURVIVED BY SEX BY CLASS'\n", " '\\nAFTER REPLACEMENTS WITH SURVIVED SUBGROUP AVERAGES')\n", "print_average_age_class()" ] }, { "cell_type": "code", "execution_count": null, "id": "0a3dd0b2-d483-4c07-b120-94227889b3f0", "metadata": {}, "outputs": [], "source": [ "cursor.close()\n", "conn.close()" ] }, { "cell_type": "markdown", "id": "e7d7d142-9a6b-4289-a4d5-5b351243f7e0", "metadata": {}, "source": [ "#### (c) 2023 by Ronald Mak" ] }, { "cell_type": "code", "execution_count": null, "id": "548e1684-ce16-4a0a-91be-fc00adc8dd2a", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 5 }