{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "eb60459c-b47e-4599-a0d5-5154c40bdee9",
   "metadata": {},
   "source": [
    "### <center>San Jose State University<br>Department of Applied Data Science<br><br>**DATA 200<br>Computational Programming for Data Analytics**<br><br>Spring 2024<br>Instructor: Ron Mak</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c63b741-cb26-4340-9ca0-d8b89ec8a397",
   "metadata": {},
   "source": [
    "# Plot earthquake data with the `pandas` module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a65c9883-8c96-4da4-a206-e1b142cb270e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d4054aa-4404-449f-97ef-1875e47fa2f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes = pd.read_csv('earthquakes.csv')\n",
    "quakes.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4c7f0ea-9c73-4b37-95cf-bcdcce174e35",
   "metadata": {},
   "source": [
    "## Cumulative distribution function (CDF)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "442fb435-f41d-425c-8b62-50956c13d44c",
   "metadata": {},
   "source": [
    "#### Suppose that we want to know the probability of a value of interest being less than or equal to a particular value. This is the **cumulative disribution function (CDF)**. Using the `statsmodels` package, we can estimate the CDF giving us the **empirical cumulative distribution function (ECDF)**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e22b4e8-aac5-4f77-b1ff-5d467624e58a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.distributions.empirical_distribution import ECDF\n",
    "\n",
    "ecdf = ECDF(quakes.query('magType == \"ml\"').mag)\n",
    "plt.plot(ecdf.x, ecdf.y)\n",
    "\n",
    "plt.xlabel('mag')\n",
    "plt.ylabel('cumulative probability')\n",
    "plt.title('ECDF of earthquake magnitude with magType ml')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25d26686-af7a-4947-ac67-adfaa6fc6046",
   "metadata": {},
   "source": [
    "#### For example, can make it clear that the probability of getting an earthquake with magnitude less than or equal to 3  using the `ml` scale is 98%:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cce670b7-bbe1-4489-a59f-b31f49531df3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.distributions.empirical_distribution import ECDF\n",
    "\n",
    "ecdf = ECDF(quakes.query('magType == \"ml\"').mag)\n",
    "plt.plot(ecdf.x, ecdf.y)\n",
    "\n",
    "plt.xlabel('mag')\n",
    "plt.ylabel('cumulative probability')\n",
    "\n",
    "# Add reference lines for interpreting the ECDF for mag <= 3.\n",
    "plt.plot(\n",
    "    [3, 3], [0, .98], '--k', \n",
    "    [-1.5, 3], [0.98, 0.98], '--k', alpha=0.4\n",
    ")\n",
    "\n",
    "# Set axis ranges.\n",
    "plt.ylim(0, None)\n",
    "plt.xlim(-1.25, None)\n",
    "\n",
    "plt.title('P(mag <= 3) = 98%')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02a574d1-502b-4756-bab5-2f22cf8ead8d",
   "metadata": {},
   "source": [
    "## `DataFrame groupby()`method\n",
    "#### The `groupby()` method collect identical data into groups and perform aggregate functions on the grouped data. See [Pandas groupby() Explained With Examples](https://sparkbyexamples.com/pandas/pandas-groupby-explained-with-examples/)\n",
    "#### For example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a976a706-fd49-4993-a49f-786a483755b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "technologies = (\n",
    "    {\n",
    "        'Courses':  [\"Spark\" ,\"PySpark\", \"Hadoop\", \"Python\",\n",
    "                     \"Pandas\", \"Hadoop\", \"Spark\", \"Python\", \"NA\"],\n",
    "        'Fee':      [22000, 25000, 23000, 24000, \n",
    "                     26000, 25000, 25000, 22000, 1500],\n",
    "        'Duration': ['30 days', '50 days', '55 days', '40 days',\n",
    "                     '60 days', '35 days', '30 days', '50 days',\n",
    "                     '40 days'],\n",
    "        'Discount': [1000, 2300, 1000, 1200, \n",
    "                     2500, None, 1400, 1600, 0]\n",
    "    }\n",
    ")\n",
    "\n",
    "df = pd.DataFrame(technologies)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d0d0db5-9fe4-4372-8e5a-ef4d242fb955",
   "metadata": {},
   "source": [
    "#### The `sum()` method calculates the sum of each group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "582dc281-8c27-40f8-9377-2cda6534172b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_1 = df.groupby(['Courses']).sum()\n",
    "df_1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22622955-9516-436b-8872-66f5a1cac4e3",
   "metadata": {},
   "source": [
    "#### We can group by multiple columns: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25e16ed2-873a-4442-b1c4-1988af705a0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_2 = df.groupby(['Courses', 'Duration']).sum()\n",
    "df_2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d0c47fb-78f8-42b2-9f17-61558d5ea449",
   "metadata": {},
   "source": [
    "## Box plots"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "484b16ba-f20f-4105-a859-142630e5e995",
   "metadata": {},
   "source": [
    "#### Use `groupby()` to see the distribution of magnitudes across the different measurement methods for earthquakes by making a box plot of each group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73fa01e3-4a17-4a3f-ac04-2546c551919a",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes[['mag', 'magType']].groupby('magType').boxplot(\n",
    "    figsize=(15, 8), subplots=False\n",
    ")\n",
    "\n",
    "plt.title('Earthquake Magnitude Box Plots by magType')\n",
    "plt.ylabel('magnitude')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89717940-a58d-470f-9e96-0c8c3fdeb3b7",
   "metadata": {},
   "source": [
    "## Bar charts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "accb0f6f-4069-4910-b4b1-7afb557663d3",
   "metadata": {},
   "source": [
    "#### Dataframe method `value_counts()` returns a `Series` object containing counts in descending order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fbb69b28-832b-496a-858f-8a5058d87515",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.parsed_place.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88d607b0-0a68-4076-80ce-1a471c5f1678",
   "metadata": {},
   "source": [
    "#### Pass keyword argument `kind=bar` for vertical bars and `kind=barh` for horizontal bars."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32662d8b-bd50-4ae8-b010-a625b6734791",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.magType.value_counts().plot(\n",
    "    kind='bar', \n",
    "    title='Earthquakes Recorded per magType', \n",
    "    rot=0\n",
    ")\n",
    "\n",
    "plt.xlabel('magType')\n",
    "plt.ylabel('earthquakes')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5f2cdc2-efe1-4900-8eca-6300c3f89f50",
   "metadata": {},
   "source": [
    "### Plot the top 15 values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e74c781c-dbad-4ad3-9162-3db9e20b1f81",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.parsed_place.value_counts().iloc[14]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8f5daf15-87c9-44db-b6d6-43859c926423",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.parsed_place.value_counts().iloc[14::-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f95d2596-df50-41dc-8fee-5b504874376d",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.parsed_place.value_counts().iloc[14::-1,].plot(\n",
    "    kind='barh', figsize=(10, 8),\n",
    "    title='Top 15 Places for Earthquakes '\n",
    "          '(September 18, 2018 - October 13, 2018)'\n",
    ")\n",
    "\n",
    "plt.xlabel('earthquakes')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c8ff841-7b9f-46fc-9b25-ef62dead6a11",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.groupby('parsed_place').tsunami.sum() \\\n",
    "    .sort_values()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bda6fb1-f40a-4935-a33e-77300b3b5a0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.groupby('parsed_place').tsunami.sum() \\\n",
    "    .sort_values().iloc[-10:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ecb076c-34d9-49f8-b24a-07209a3769e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.groupby('parsed_place').tsunami.sum() \\\n",
    "    .sort_values().iloc[-10:].plot(\n",
    "        kind='barh', figsize=(10,6), \n",
    "        title='Top 10 Places for Tsunamis '\n",
    "              '(September 18, 2018 - October 13, 2018)'\n",
    ")\n",
    "\n",
    "plt.xlabel('tsunamis')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e2c2da5-8db8-4eb8-896a-c8b5a7c6996f",
   "metadata": {},
   "source": [
    "## Lambda function"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e54dc33d-2eaa-4f5f-8dc0-43833b895831",
   "metadata": {},
   "source": [
    "#### A ***lambda function*** is an unnamed function that is applied to each element of a series. For example, the lambda function below converts each time value, represented by parameter `x`.\n",
    "#### To display all the rows of a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35bd12bd-da41-4b92-86cd-14f4741b6d29",
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_rows', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ed5a5ce-18fc-4892-9055-8ced5b317e55",
   "metadata": {},
   "outputs": [],
   "source": [
    "quakes.query('parsed_place == \"Indonesia\"') \\\n",
    "    .assign(\n",
    "        time=lambda x: pd.to_datetime(x.time, unit='ms'),\n",
    "        earthquake=1\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f9d6fde-406c-433f-833a-395d39557114",
   "metadata": {},
   "source": [
    "## More bar charts"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86ad8cdb-bf23-4033-b932-5e6d93066c5a",
   "metadata": {},
   "source": [
    "#### The `resample()` function for time series data rearranges the data by a different time unit (daily in the example below) for frequency calculations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b16ec39-98e5-43f5-a0a3-cb90a8db98d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "indonesia_quakes = quakes.query('parsed_place == \"Indonesia\"') \\\n",
    "    .assign(\n",
    "        time=lambda x: pd.to_datetime(x.time, unit='ms'),\n",
    "        earthquake=1\n",
    "    ).set_index('time').resample('1D').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a659e07c-132f-4954-8392-08b38bf5032c",
   "metadata": {},
   "source": [
    "#### Format the datetimes in the index for the x-axis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3cf0aaba-dcd7-4f4b-a07e-07b72e7f8383",
   "metadata": {},
   "outputs": [],
   "source": [
    "indonesia_quakes.index = indonesia_quakes.index.strftime('%b\\n%d')\n",
    "indonesia_quakes.index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54f0bd38-0bbd-42c7-a3f7-274c8309a5f8",
   "metadata": {},
   "source": [
    "#### Plot the number of earthquakes and tsunamis together, day by day."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14b3a612-4b04-45b5-9acb-1f2cc5065d4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "indonesia_quakes.plot(\n",
    "    y=['earthquake', 'tsunami'], kind='bar', figsize=(15, 7), \n",
    "    rot=0, label=['earthquakes', 'tsunamis'], \n",
    "    title='Earthquakes and Tsunamis in Indonesia '\n",
    "          '(September 18, 2018 - October 13, 2018)'\n",
    ")\n",
    "\n",
    "plt.xlabel('date')\n",
    "plt.ylabel('count')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2cbf710a-118e-43d2-8cde-ee9d6242cb00",
   "metadata": {},
   "source": [
    "## Stacked bars"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b94c667-c7a4-4a61-adcb-c28de0cdac8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "pivoted_table = quakes.assign(\n",
    "    mag_bin=lambda x: np.floor(x.mag)\n",
    ").pivot_table(\n",
    "    index='mag_bin', columns='magType', \n",
    "    values='mag', aggfunc='count'\n",
    ")\n",
    "\n",
    "pivoted_table.plot.bar(\n",
    "    stacked=True, rot=0, ylabel='earthquakes', \n",
    "    title='Earthquakes by integer magnitude and magType'\n",
    ")\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3ade5c2b-baf0-4334-885c-ae11a7b8f983",
   "metadata": {},
   "outputs": [],
   "source": [
    "normalized_pivoted_table = pivoted_table.fillna(0) \\\n",
    "    .apply(lambda x: x/x.sum(), axis=1)\n",
    "\n",
    "ax = normalized_pivoted_table.plot.bar(\n",
    "    stacked=True, rot=0, figsize=(10, 5),\n",
    "    title='Percentage of earthquakes by integer magnitude for each magType'\n",
    ")\n",
    "\n",
    "# Move legend to the right of the plot.\n",
    "ax.legend(bbox_to_anchor=(1, 0.8)) \n",
    "plt.ylabel('percentage')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ff24c8b-e331-4153-aa76-83a669322819",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b03ea2df-b878-4ef0-85af-331d2830da73",
   "metadata": {},
   "source": [
    "#### Adapted from ***Hands-On Data Analysis with Pandas, second edition***, by Stephanie Molin, Packt 2021, ISBN 978-1-80056-345-2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a58347b5-17a5-4ce8-87c6-126d0da8f8bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Additional material (c) 2024 by Ronald Mak"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}