{ "cells": [ { "cell_type": "markdown", "id": "eb60459c-b47e-4599-a0d5-5154c40bdee9", "metadata": {}, "source": [ "###
San Jose State University
Department of Applied Data Science

**DATA 200
Computational Programming for Data Analytics**

Spring 2024
Instructor: Ron Mak
" ] }, { "cell_type": "markdown", "id": "3c63b741-cb26-4340-9ca0-d8b89ec8a397", "metadata": {}, "source": [ "# Plot earthquake data with the `pandas` module" ] }, { "cell_type": "code", "execution_count": null, "id": "a65c9883-8c96-4da4-a206-e1b142cb270e", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "id": "7d4054aa-4404-449f-97ef-1875e47fa2f9", "metadata": {}, "outputs": [], "source": [ "quakes = pd.read_csv('earthquakes.csv')\n", "quakes.head()" ] }, { "cell_type": "markdown", "id": "f4c7f0ea-9c73-4b37-95cf-bcdcce174e35", "metadata": {}, "source": [ "## Cumulative distribution function (CDF)" ] }, { "cell_type": "markdown", "id": "442fb435-f41d-425c-8b62-50956c13d44c", "metadata": {}, "source": [ "#### Suppose that we want to know the probability of a value of interest being less than or equal to a particular value. This is the **cumulative disribution function (CDF)**. Using the `statsmodels` package, we can estimate the CDF giving us the **empirical cumulative distribution function (ECDF)**:" ] }, { "cell_type": "code", "execution_count": null, "id": "3e22b4e8-aac5-4f77-b1ff-5d467624e58a", "metadata": {}, "outputs": [], "source": [ "from statsmodels.distributions.empirical_distribution import ECDF\n", "\n", "ecdf = ECDF(quakes.query('magType == \"ml\"').mag)\n", "plt.plot(ecdf.x, ecdf.y)\n", "\n", "plt.xlabel('mag')\n", "plt.ylabel('cumulative probability')\n", "plt.title('ECDF of earthquake magnitude with magType ml')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "25d26686-af7a-4947-ac67-adfaa6fc6046", "metadata": {}, "source": [ "#### For example, can make it clear that the probability of getting an earthquake with magnitude less than or equal to 3 using the `ml` scale is 98%:" ] }, { "cell_type": "code", "execution_count": null, "id": "cce670b7-bbe1-4489-a59f-b31f49531df3", "metadata": {}, "outputs": [], "source": [ "from statsmodels.distributions.empirical_distribution import ECDF\n", "\n", "ecdf = ECDF(quakes.query('magType == \"ml\"').mag)\n", "plt.plot(ecdf.x, ecdf.y)\n", "\n", "plt.xlabel('mag')\n", "plt.ylabel('cumulative probability')\n", "\n", "# Add reference lines for interpreting the ECDF for mag <= 3.\n", "plt.plot(\n", " [3, 3], [0, .98], '--k', \n", " [-1.5, 3], [0.98, 0.98], '--k', alpha=0.4\n", ")\n", "\n", "# Set axis ranges.\n", "plt.ylim(0, None)\n", "plt.xlim(-1.25, None)\n", "\n", "plt.title('P(mag <= 3) = 98%')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "02a574d1-502b-4756-bab5-2f22cf8ead8d", "metadata": {}, "source": [ "## `DataFrame groupby()`method\n", "#### The `groupby()` method collect identical data into groups and perform aggregate functions on the grouped data. See [Pandas groupby() Explained With Examples](https://sparkbyexamples.com/pandas/pandas-groupby-explained-with-examples/)\n", "#### For example:" ] }, { "cell_type": "code", "execution_count": null, "id": "a976a706-fd49-4993-a49f-786a483755b7", "metadata": {}, "outputs": [], "source": [ "technologies = (\n", " {\n", " 'Courses': [\"Spark\" ,\"PySpark\", \"Hadoop\", \"Python\",\n", " \"Pandas\", \"Hadoop\", \"Spark\", \"Python\", \"NA\"],\n", " 'Fee': [22000, 25000, 23000, 24000, \n", " 26000, 25000, 25000, 22000, 1500],\n", " 'Duration': ['30 days', '50 days', '55 days', '40 days',\n", " '60 days', '35 days', '30 days', '50 days',\n", " '40 days'],\n", " 'Discount': [1000, 2300, 1000, 1200, \n", " 2500, None, 1400, 1600, 0]\n", " }\n", ")\n", "\n", "df = pd.DataFrame(technologies)\n", "df" ] }, { "cell_type": "markdown", "id": "5d0d0db5-9fe4-4372-8e5a-ef4d242fb955", "metadata": {}, "source": [ "#### The `sum()` method calculates the sum of each group:" ] }, { "cell_type": "code", "execution_count": null, "id": "582dc281-8c27-40f8-9377-2cda6534172b", "metadata": {}, "outputs": [], "source": [ "df_1 = df.groupby(['Courses']).sum()\n", "df_1" ] }, { "cell_type": "markdown", "id": "22622955-9516-436b-8872-66f5a1cac4e3", "metadata": {}, "source": [ "#### We can group by multiple columns: " ] }, { "cell_type": "code", "execution_count": null, "id": "25e16ed2-873a-4442-b1c4-1988af705a0e", "metadata": {}, "outputs": [], "source": [ "df_2 = df.groupby(['Courses', 'Duration']).sum()\n", "df_2" ] }, { "cell_type": "markdown", "id": "4d0c47fb-78f8-42b2-9f17-61558d5ea449", "metadata": {}, "source": [ "## Box plots" ] }, { "cell_type": "markdown", "id": "484b16ba-f20f-4105-a859-142630e5e995", "metadata": {}, "source": [ "#### Use `groupby()` to see the distribution of magnitudes across the different measurement methods for earthquakes by making a box plot of each group:" ] }, { "cell_type": "code", "execution_count": null, "id": "73fa01e3-4a17-4a3f-ac04-2546c551919a", "metadata": {}, "outputs": [], "source": [ "quakes[['mag', 'magType']].groupby('magType').boxplot(\n", " figsize=(15, 8), subplots=False\n", ")\n", "\n", "plt.title('Earthquake Magnitude Box Plots by magType')\n", "plt.ylabel('magnitude')" ] }, { "cell_type": "markdown", "id": "89717940-a58d-470f-9e96-0c8c3fdeb3b7", "metadata": {}, "source": [ "## Bar charts" ] }, { "cell_type": "markdown", "id": "accb0f6f-4069-4910-b4b1-7afb557663d3", "metadata": {}, "source": [ "#### Dataframe method `value_counts()` returns a `Series` object containing counts in descending order." ] }, { "cell_type": "code", "execution_count": null, "id": "fbb69b28-832b-496a-858f-8a5058d87515", "metadata": {}, "outputs": [], "source": [ "quakes.parsed_place.value_counts()" ] }, { "cell_type": "markdown", "id": "88d607b0-0a68-4076-80ce-1a471c5f1678", "metadata": {}, "source": [ "#### Pass keyword argument `kind=bar` for vertical bars and `kind=barh` for horizontal bars." ] }, { "cell_type": "code", "execution_count": null, "id": "32662d8b-bd50-4ae8-b010-a625b6734791", "metadata": {}, "outputs": [], "source": [ "quakes.magType.value_counts().plot(\n", " kind='bar', \n", " title='Earthquakes Recorded per magType', \n", " rot=0\n", ")\n", "\n", "plt.xlabel('magType')\n", "plt.ylabel('earthquakes')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e5f2cdc2-efe1-4900-8eca-6300c3f89f50", "metadata": {}, "source": [ "### Plot the top 15 values." ] }, { "cell_type": "code", "execution_count": null, "id": "e74c781c-dbad-4ad3-9162-3db9e20b1f81", "metadata": {}, "outputs": [], "source": [ "quakes.parsed_place.value_counts().iloc[14]" ] }, { "cell_type": "code", "execution_count": null, "id": "8f5daf15-87c9-44db-b6d6-43859c926423", "metadata": {}, "outputs": [], "source": [ "quakes.parsed_place.value_counts().iloc[14::-1]" ] }, { "cell_type": "code", "execution_count": null, "id": "f95d2596-df50-41dc-8fee-5b504874376d", "metadata": {}, "outputs": [], "source": [ "quakes.parsed_place.value_counts().iloc[14::-1,].plot(\n", " kind='barh', figsize=(10, 8),\n", " title='Top 15 Places for Earthquakes '\n", " '(September 18, 2018 - October 13, 2018)'\n", ")\n", "\n", "plt.xlabel('earthquakes')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "2c8ff841-7b9f-46fc-9b25-ef62dead6a11", "metadata": {}, "outputs": [], "source": [ "quakes.groupby('parsed_place').tsunami.sum() \\\n", " .sort_values()" ] }, { "cell_type": "code", "execution_count": null, "id": "7bda6fb1-f40a-4935-a33e-77300b3b5a0a", "metadata": {}, "outputs": [], "source": [ "quakes.groupby('parsed_place').tsunami.sum() \\\n", " .sort_values().iloc[-10:]" ] }, { "cell_type": "code", "execution_count": null, "id": "0ecb076c-34d9-49f8-b24a-07209a3769e0", "metadata": {}, "outputs": [], "source": [ "quakes.groupby('parsed_place').tsunami.sum() \\\n", " .sort_values().iloc[-10:].plot(\n", " kind='barh', figsize=(10,6), \n", " title='Top 10 Places for Tsunamis '\n", " '(September 18, 2018 - October 13, 2018)'\n", ")\n", "\n", "plt.xlabel('tsunamis')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "3e2c2da5-8db8-4eb8-896a-c8b5a7c6996f", "metadata": {}, "source": [ "## Lambda function" ] }, { "cell_type": "markdown", "id": "e54dc33d-2eaa-4f5f-8dc0-43833b895831", "metadata": {}, "source": [ "#### A ***lambda function*** is an unnamed function that is applied to each element of a series. For example, the lambda function below converts each time value, represented by parameter `x`.\n", "#### To display all the rows of a dataframe:" ] }, { "cell_type": "code", "execution_count": null, "id": "35bd12bd-da41-4b92-86cd-14f4741b6d29", "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_rows', None)" ] }, { "cell_type": "code", "execution_count": null, "id": "6ed5a5ce-18fc-4892-9055-8ced5b317e55", "metadata": {}, "outputs": [], "source": [ "quakes.query('parsed_place == \"Indonesia\"') \\\n", " .assign(\n", " time=lambda x: pd.to_datetime(x.time, unit='ms'),\n", " earthquake=1\n", " )" ] }, { "cell_type": "markdown", "id": "7f9d6fde-406c-433f-833a-395d39557114", "metadata": {}, "source": [ "## More bar charts" ] }, { "cell_type": "markdown", "id": "86ad8cdb-bf23-4033-b932-5e6d93066c5a", "metadata": {}, "source": [ "#### The `resample()` function for time series data rearranges the data by a different time unit (daily in the example below) for frequency calculations." ] }, { "cell_type": "code", "execution_count": null, "id": "3b16ec39-98e5-43f5-a0a3-cb90a8db98d5", "metadata": {}, "outputs": [], "source": [ "indonesia_quakes = quakes.query('parsed_place == \"Indonesia\"') \\\n", " .assign(\n", " time=lambda x: pd.to_datetime(x.time, unit='ms'),\n", " earthquake=1\n", " ).set_index('time').resample('1D').sum()" ] }, { "cell_type": "markdown", "id": "a659e07c-132f-4954-8392-08b38bf5032c", "metadata": {}, "source": [ "#### Format the datetimes in the index for the x-axis." ] }, { "cell_type": "code", "execution_count": null, "id": "3cf0aaba-dcd7-4f4b-a07e-07b72e7f8383", "metadata": {}, "outputs": [], "source": [ "indonesia_quakes.index = indonesia_quakes.index.strftime('%b\\n%d')\n", "indonesia_quakes.index" ] }, { "cell_type": "markdown", "id": "54f0bd38-0bbd-42c7-a3f7-274c8309a5f8", "metadata": {}, "source": [ "#### Plot the number of earthquakes and tsunamis together, day by day." ] }, { "cell_type": "code", "execution_count": null, "id": "14b3a612-4b04-45b5-9acb-1f2cc5065d4f", "metadata": {}, "outputs": [], "source": [ "indonesia_quakes.plot(\n", " y=['earthquake', 'tsunami'], kind='bar', figsize=(15, 7), \n", " rot=0, label=['earthquakes', 'tsunamis'], \n", " title='Earthquakes and Tsunamis in Indonesia '\n", " '(September 18, 2018 - October 13, 2018)'\n", ")\n", "\n", "plt.xlabel('date')\n", "plt.ylabel('count')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "2cbf710a-118e-43d2-8cde-ee9d6242cb00", "metadata": {}, "source": [ "## Stacked bars" ] }, { "cell_type": "code", "execution_count": null, "id": "4b94c667-c7a4-4a61-adcb-c28de0cdac8a", "metadata": {}, "outputs": [], "source": [ "pivoted_table = quakes.assign(\n", " mag_bin=lambda x: np.floor(x.mag)\n", ").pivot_table(\n", " index='mag_bin', columns='magType', \n", " values='mag', aggfunc='count'\n", ")\n", "\n", "pivoted_table.plot.bar(\n", " stacked=True, rot=0, ylabel='earthquakes', \n", " title='Earthquakes by integer magnitude and magType'\n", ")\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "3ade5c2b-baf0-4334-885c-ae11a7b8f983", "metadata": {}, "outputs": [], "source": [ "normalized_pivoted_table = pivoted_table.fillna(0) \\\n", " .apply(lambda x: x/x.sum(), axis=1)\n", "\n", "ax = normalized_pivoted_table.plot.bar(\n", " stacked=True, rot=0, figsize=(10, 5),\n", " title='Percentage of earthquakes by integer magnitude for each magType'\n", ")\n", "\n", "# Move legend to the right of the plot.\n", "ax.legend(bbox_to_anchor=(1, 0.8)) \n", "plt.ylabel('percentage')\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "7ff24c8b-e331-4153-aa76-83a669322819", "metadata": {}, "outputs": [], "source": [ "plt.close()" ] }, { "cell_type": "markdown", "id": "b03ea2df-b878-4ef0-85af-331d2830da73", "metadata": {}, "source": [ "#### Adapted from ***Hands-On Data Analysis with Pandas, second edition***, by Stephanie Molin, Packt 2021, ISBN 978-1-80056-345-2" ] }, { "cell_type": "code", "execution_count": null, "id": "a58347b5-17a5-4ce8-87c6-126d0da8f8bd", "metadata": {}, "outputs": [], "source": [ "# Additional material (c) 2024 by Ronald Mak" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }