{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <center>San Jose State University<br>Department of Applied Data Science<br><br>**DATA 200<br>Computational Programming for Data Analytics**<br><br>Spring 2024<br>Instructor: Ron Mak</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8.12.1 The `re` Module and Function `fullmatch()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### An important task of machine learning is extracting critcal data from text that we want to analyze. We discover a common _pattern_ in the critical data, and we use that pattern to search the text for the data. The Python `re` module supports **regular expressions**. A regular expression is a string that represents a pattern, and it contains special **metacharacters** that enable powerful searches. Regular expressions allow you to extract data from unstructured text such as social media posts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Matching Literal Characters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### We can use the regular expression function `fullmatch()` to test whether the _entire string_ value of its second argument matches the value of the regular expression pattern  of its first argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = '02215'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(pattern, '02215') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(pattern, '51220') else 'No match'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Metacharacters, Character Classes and Quantifiers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Regular expressions typically contain characters that are treated as metacharacters:\n",
    "| Regular Expression Metacharacters |\n",
    "| :-: |\n",
    "|`[]  {}  ()  \\  *  +  -  ^  $  ?  .` |\n",
    "#### For example, in a regular expression, `\\d{5}` is a pattern. The `\\d` is is a **character class** that represents a _single_ digit character. The **quantifier** `{5}` says to match five consecutive digits. It's shorthand for `\\d\\d\\d\\d\\d`.\n",
    "#### An important metacharacter is the dot `.` which matches _any_ single character.\n",
    "#### If a metacharacter must appear in a regular expression as itself and not as a metacharacter, quote the character with `\\` (back slash). For example, `\\.` can only match a period or a decimal point in a regular expression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Valid' if re.fullmatch(r'\\d{5}', '02215') else 'Invalid'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "'Valid' if re.fullmatch(r'\\d{5}', '9876') else 'Invalid'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other Predefined Character Classes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| Character class | Matches ... |\n",
    "| :-: | --- |\n",
    "| `\\d` | Any digit character `'0'` through `'9'`. |\n",
    "| `\\D` | Any character that is _not_ a digit. |\n",
    "| `\\s` | Any whitespace character such as spaces, tabs, and newlines. |\n",
    "| `\\S` | Any character that is _not_ a whitespace character. |\n",
    "| `\\w` | Any word (alphanumeric) character, including `_`. |\n",
    "| `\\W` | Any character that is _not_ a word character."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Custom Character Classes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Square brackets `[]` define a **custom character class** that matches a _single_ character. Examples:\n",
    "- `[aeiou]` matches any lower-case vowel letter.\n",
    "- `[aeiouAEIOU]` matches any lower- or upper-case vowel letter.\n",
    "- `[A-Z]` matches any upper-case letter.\n",
    "- `[a-z]` matches any lower-case letter.\n",
    "- `[a-zA-Z_]` matches any lower- or upper-case letter or the underscore.\n",
    "#### The quantifier `*` says to match _zero or more_ occurrences of the preceding subpattern. Therefore, \n",
    "```\n",
    "[A-Z][a-z]*\n",
    "```\n",
    "#### matches an upper-case letter followed by zero or more lower-case letters. `.*` matches a run of any combination of zero or more characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Valid' if re.fullmatch('[A-Z][a-z]*', 'Wally') else 'Invalid'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Valid' if re.fullmatch('[A-Z][a-z]*', 'eva') else 'Invalid'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### When a custom character class starts with the caret `^` metacharacter, it says to match a character that is _not_ in the class. For example, `[^a-z]` matches any character that is _not_ a lower-case letter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch('[^a-z]', 'A') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch('[^a-z]', 'a') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch('[*+$]', '*') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch('[*+$]', '!') else 'No match'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### * vs. + Quantifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The quantifier `+` says to match _at least one_ occurrence of the preceding subpattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Valid' if re.fullmatch('[A-Z][a-z]+', 'Wally') else 'Invalid'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Valid' if re.fullmatch('[A-Z][a-z]+', 'E') else 'Invalid'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Greedy Quantifiers\n",
    "#### Both `*` and `+` are **greedy**. They specify matching as many characters as possible -- the longest possible substring."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(r'[A-Z]\\w*', 'BethAnn') else 'No match'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other Quantifiers\n",
    "#### Quantifier `?` matches _zero or one_ occurences of the preceding subpattern. In other words, it means \"optional\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch('labell?ed', 'labelled') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch('labell?ed', 'labeled') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch('labell?ed', 'labellled') else 'No match'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The quantifier `{`_n_`,}` says to match _at least n occurrences_."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(r'\\d{3,}', '123') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(r'\\d{3,}', '1234567890') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(r'\\d{3,}', '12') else 'No match'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The quantifier `{`_n_`,`_m_`}` says to match _between n and m occurrences_, inclusive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(r'\\d{3,6}', '123') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(r'\\d{3,6}', '123456') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(r'\\d{3,6}', '1234567') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'Match' if re.fullmatch(r'\\d{3,6}', '12') else 'No match'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "##########################################################################\n",
    "# (C) Copyright 2019 by Deitel & Associates, Inc. and                    #\n",
    "# Pearson Education, Inc. All Rights Reserved.                           #\n",
    "#                                                                        #\n",
    "# DISCLAIMER: The authors and publisher of this book have used their     #\n",
    "# best efforts in preparing the book. These efforts include the          #\n",
    "# development, research, and testing of the theories and programs        #\n",
    "# to determine their effectiveness. The authors and publisher make       #\n",
    "# no warranty of any kind, expressed or implied, with regard to these    #\n",
    "# programs or to the documentation contained in these books. The authors #\n",
    "# and publisher shall not be liable in any event for incidental or       #\n",
    "# consequential damages in connection with, or arising out of, the       #\n",
    "# furnishing, performance, or use of these programs.                     #\n",
    "##########################################################################\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Additional material (C) Copyright 2024 by Ronald Mak"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}