San Jose State University Department of Applied Data Science
**DATA 200 Computational Programming for Data Analytics**
Spring 2024 Instructor: Ron Mak
"
]
},
{
"cell_type": "markdown",
"id": "51ffc2fa-af0f-4ab5-bce5-e0fcc762b7a9",
"metadata": {},
"source": [
"# `pandas`"
]
},
{
"cell_type": "markdown",
"id": "c853e87b-fab2-446c-a986-cff52c57ccf7",
"metadata": {},
"source": [
"#### `numpy` arrays are optimized for homogeneous numeric data that's accessed via integer indices. But \"Big Data\" applications must support mixed datatypes, custom indexing, missing data, and data that's not structured consistently.\n",
"#### `pandas` is a module from the Python Standard Library that offers data structures and methods to manipulate different types of data. They are easy to use and highly optimized for performance."
]
},
{
"cell_type": "markdown",
"id": "a8ef7426-bf4b-4884-95eb-ca87f43e0dda",
"metadata": {},
"source": [
"#### `pandas` has two key data collections, `Series` and `DataFrame`. Both are based on `numpy` arrays. Many `Series` and `DataFrame` operations can take `numpy` arrays as arguments, and many `numpy` operations can take `Series` and `DataFrame` arguments."
]
},
{
"cell_type": "markdown",
"id": "f376642f-1a5e-45d4-86d3-b7cf78c66d65",
"metadata": {},
"source": [
"#### The original developer of `pandas` derived its name from \"panel data\" when he was working with data for measurements over time, such as stock prices and historical temperature readings."
]
},
{
"cell_type": "markdown",
"id": "7ee4a57e-4b08-47c8-85e3-89a9e31a2a33",
"metadata": {},
"source": [
"## Advantages of `pandas` over `numpy`"
]
},
{
"cell_type": "markdown",
"id": "35d7b998-b622-46d1-a788-abcdd46e34b2",
"metadata": {},
"source": [
"#### **Higher level of abstraction.** `pandas` offers a simplier API for developers by abstracting away some of the complex concepts.\n",
"#### **Less intuition.** Many `pandas` methods require less intuition by developers but are still very powerful.\n",
"#### **Faster processing.** Some `DataFrame` operations can be much faster, depending on the data and their structure.\n",
"#### **Designed for \"Big Data\".** A `DataFrame` is ideal for operating on large datasets."
]
},
{
"cell_type": "markdown",
"id": "522db182-b5e9-4211-bc0a-c97218356dcb",
"metadata": {},
"source": [
"## Disadvantages of `pandas` over `numpy`"
]
},
{
"cell_type": "markdown",
"id": "da0ae4c8-c9c8-458f-9092-22f8011ff9f6",
"metadata": {},
"source": [
"#### **Less applicable.** The higher level of abstraction can make `pandas` less applicable. Some operations can become more complex.\n",
"#### **More momory and disk space.** `pandas` dataframes require more memory and disk space than `numpy` arrays.\n",
"#### **Performance problems.** Heavy joins can cause performance and memory usage problems.\n",
"#### **Hidden complexity.** The simple API can hide complexity from programmers and result in inefficient code."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b4630b19-b1ed-45d5-848b-a6c94372f01a",
"metadata": {},
"outputs": [],
"source": [
"# (C) Copyright 2023 by Ronald Mak"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3affb551-5ad6-42bf-818a-d1faeb420d3f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}