{ "cells": [ { "cell_type": "markdown", "id": "042469fa-ab68-47f1-93dd-22d1160b8ac7", "metadata": {}, "source": [ "###
San Jose State University
Department of Applied Data Science

**DATA 200
Computational Programming for Data Analytics**

Spring 2024
Instructor: Ron Mak
" ] }, { "cell_type": "markdown", "id": "51ffc2fa-af0f-4ab5-bce5-e0fcc762b7a9", "metadata": {}, "source": [ "# `pandas`" ] }, { "cell_type": "markdown", "id": "c853e87b-fab2-446c-a986-cff52c57ccf7", "metadata": {}, "source": [ "#### `numpy` arrays are optimized for homogeneous numeric data that's accessed via integer indices. But \"Big Data\" applications must support mixed datatypes, custom indexing, missing data, and data that's not structured consistently.\n", "#### `pandas` is a module from the Python Standard Library that offers data structures and methods to manipulate different types of data. They are easy to use and highly optimized for performance." ] }, { "cell_type": "markdown", "id": "a8ef7426-bf4b-4884-95eb-ca87f43e0dda", "metadata": {}, "source": [ "#### `pandas` has two key data collections, `Series` and `DataFrame`. Both are based on `numpy` arrays. Many `Series` and `DataFrame` operations can take `numpy` arrays as arguments, and many `numpy` operations can take `Series` and `DataFrame` arguments." ] }, { "cell_type": "markdown", "id": "f376642f-1a5e-45d4-86d3-b7cf78c66d65", "metadata": {}, "source": [ "#### The original developer of `pandas` derived its name from \"panel data\" when he was working with data for measurements over time, such as stock prices and historical temperature readings." ] }, { "cell_type": "markdown", "id": "7ee4a57e-4b08-47c8-85e3-89a9e31a2a33", "metadata": {}, "source": [ "## Advantages of `pandas` over `numpy`" ] }, { "cell_type": "markdown", "id": "35d7b998-b622-46d1-a788-abcdd46e34b2", "metadata": {}, "source": [ "#### **Higher level of abstraction.** `pandas` offers a simplier API for developers by abstracting away some of the complex concepts.\n", "#### **Less intuition.** Many `pandas` methods require less intuition by developers but are still very powerful.\n", "#### **Faster processing.** Some `DataFrame` operations can be much faster, depending on the data and their structure.\n", "#### **Designed for \"Big Data\".** A `DataFrame` is ideal for operating on large datasets." ] }, { "cell_type": "markdown", "id": "522db182-b5e9-4211-bc0a-c97218356dcb", "metadata": {}, "source": [ "## Disadvantages of `pandas` over `numpy`" ] }, { "cell_type": "markdown", "id": "da0ae4c8-c9c8-458f-9092-22f8011ff9f6", "metadata": {}, "source": [ "#### **Less applicable.** The higher level of abstraction can make `pandas` less applicable. Some operations can become more complex.\n", "#### **More momory and disk space.** `pandas` dataframes require more memory and disk space than `numpy` arrays.\n", "#### **Performance problems.** Heavy joins can cause performance and memory usage problems.\n", "#### **Hidden complexity.** The simple API can hide complexity from programmers and result in inefficient code." ] }, { "cell_type": "code", "execution_count": null, "id": "b4630b19-b1ed-45d5-848b-a6c94372f01a", "metadata": {}, "outputs": [], "source": [ "# (C) Copyright 2023 by Ronald Mak" ] }, { "cell_type": "code", "execution_count": null, "id": "3affb551-5ad6-42bf-818a-d1faeb420d3f", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }