{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c88bbad0-7979-4dda-b499-391682c733ab",
   "metadata": {},
   "source": [
    "### <center>San Jose State University<br>Department of Applied Data Science<br><br>**DATA 200<br>Computational Programming for Data Analytics**<br><br>Spring 2024<br>Instructor: Ron Mak<br><br>**Assignment #7<br>SJSU Administrators and Professors**<br><br>Assigned: March 14, 2024<br>Due: March 21 at 5:30 PM<br><br>100 points maximum<br>Individual work only!</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69082b3f-00ee-4ba6-a5f1-210d3368d1e5",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "#### Surveys have shown that after gettng a problem to analyze, data analysts spend 80% of their time doing _data wrangling_ and only 20% doing actual _analysis_. This assignment will give you some experience with both."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8316d05-f8b4-4b53-813a-0d91b3892880",
   "metadata": {},
   "source": [
    "#### **The ability to extract useful data for analysis from unstructured text, such as social media posts, is an important skill for machine learning.**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47c9a7b0-c118-4c1e-b726-93ac622e85f3",
   "metadata": {},
   "source": [
    "## Data source and data wrangling"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72982af4-0574-4ac2-a390-fc221c859dcb",
   "metadata": {},
   "source": [
    "#### Your data source will be the SJSU webpage __[Faculty & Administration](https://catalog.sjsu.edu/content.php?catoid=14&navoid=5115)__</ul>."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa962336-9e92-4a2c-a769-e342ecc69afd",
   "metadata": {
    "tags": []
   },
   "source": [
    "#### Data wrangling activities:\n",
    "- #### Identify a source that will provide a useful dataset.\n",
    "- #### Access and download the dataset.\n",
    "- #### Clean the data, which can be done either manually before you load the data, or your program can do cleaning as it reads the data: \n",
    "    - #### Filter out irrelevant data.\n",
    "    - #### Handle with missing, incorrect, or outlier data (ignore that data?).\n",
    "    - #### Correct typos.\n",
    "- #### Format the data to suit your analysis program.\n",
    "- #### Load the data for analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "668c4652-91ec-4b3a-86a4-e6fededae05c",
   "metadata": {},
   "source": [
    "## Data analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aeaa6fa6-6571-474d-83b0-7f10a9aa5d00",
   "metadata": {},
   "source": [
    "#### In this assignment, you will analyze data to answer these questions about San Jose State University:\n",
    "- #### How many professors does the university have?\n",
    "- #### How many administrators (non-professor employees) does it have?\n",
    "- #### What are the different administrator titles and how many are there?\n",
    "- #### How many administrators share each title?\n",
    "- #### What is the ratio of professors to administrators?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "705e7c97-9e03-43de-ba1a-02c8b41fa23c",
   "metadata": {},
   "source": [
    "## Analysis procedure and results"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5e524b5-0518-4f4f-a63e-336a1ffe88d7",
   "metadata": {},
   "source": [
    "#### Copy and paste the text from the webpage into a plain text file. The first few lines should be:\n",
    "```\n",
    "A\n",
    "Abousalem, Mohamed - Vice President, Research and Innovation (2019)\n",
    "\n",
    "               BS, Alexandria University, Egypt\n",
    "\n",
    "               MBA, Santa Clara University\n",
    "\n",
    "               MS and PhD, University of Calgary, Canada\n",
    "\n",
    "Abramson, Tzvia - Professor, Biological Sciences (2005)\n",
    "\n",
    "               BS and PhD, Ben Gurion University, Israel\n",
    "\n",
    "Abri, Faranak - Assistant Professor, Computer Science (2022)\n",
    "\n",
    "               PhD, Texas Tech University\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7707e944-0f62-478a-b206-12a40cf4cb4d",
   "metadata": {},
   "source": [
    "#### Assuming you named the file `employees.txt`, the following Python code will read the file line by line. Each line will be a Python string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c93f1192-fdf5-40a4-8270-7cceb7ec4184",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('employees.txt', mode='r') as admins:\n",
    "    for line in admins:\n",
    "        # Do something more with the line\n",
    "        # than simply printing it.\n",
    "        print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf4716b7-a0e6-4310-ac71-cb0ac92156a0",
   "metadata": {},
   "source": [
    "#### The rest of your program should do all of the following:\n",
    "- #### Count the number of administrators and professors.\n",
    "- #### Extract all the administrator titles, such as \"Vice President, Research and Innovation\"\n",
    "- #### Tally how many administrators have each unique title.\n",
    "- #### Print a table that shows the titles in alphabetical order and the number of administrators who have each title.\n",
    "- #### At the end, print the number of professors, the number of administrators, and the ratio of professors to administrators."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8fc510f-0ca8-4f70-b5b3-d5443eea9905",
   "metadata": {},
   "source": [
    "#### Use the `re` module and **regular expressions** in your program. You can also use other string operations."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c844e46-ecdf-40ae-ae8f-e029e2ee38df",
   "metadata": {},
   "source": [
    "## Data cleaning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6842156b-c34b-41f4-beb0-20fd0feff38d",
   "metadata": {},
   "source": [
    "#### Cleaning this data will be a challenge!\n",
    "- #### Is the data formatted consistently for your program to be able to extract the titles with your regular expressions?\n",
    "- #### Are there any (possibly invisible) non-ASCII characters embedded in the text that can trip up your program?\n",
    "- #### Are there any typos that can affect your results. For example, two titles may actually be the same but one has a misspelled word, or a line may be punctuated incorrectly.\n",
    "#### Your program should flag as errors any lines that don't match your regular expression. See what the errors are, and then modify your program to handle the errors as it reads the text. You may have to do some cleaning operations manually on the text file before your program can analyze it."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57f90532-382b-405d-96da-622c5a4bca8e",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Grading rubric"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f603810-3e1b-43e7-b75b-fa89dba53258",
   "metadata": {
    "tags": []
   },
   "source": [
    "| Criteria | Max points |\n",
    "| --- | :-: |\n",
    "| Textfile successfully read. | 5 |\n",
    "| Administrator titles successfully extracted. | 30 |\n",
    "| Use of regular expression(s). | 30 |\n",
    "| Frequencies of each title calculated. | 20 |\n",
    "| Alphabetized table of titles and frequencies. | 10 |\n",
    "| Count of titles, count of professors and administrators, and the professor-to-administrator ratio. | 5 |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75eea22d-3fba-4cf2-97fc-162e5b669b29",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
