{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "###
San Jose State University
Department of Applied Data Science

**DATA 200
Computational Programming for Data Analytics**

Spring 2024
Instructor: Ron Mak
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8.12.1 The `re` Module and Function `fullmatch()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### An important task of machine learning is extracting critcal data from text that we want to analyze. We discover a common _pattern_ in the critical data, and we use that pattern to search the text for the data. The Python `re` module supports **regular expressions**. A regular expression is a string that represents a pattern, and it contains special **metacharacters** that enable powerful searches. Regular expressions allow you to extract data from unstructured text such as social media posts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matching Literal Characters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We can use the regular expression function `fullmatch()` to test whether the _entire string_ value of its second argument matches the value of the regular expression pattern of its first argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pattern = '02215'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(pattern, '02215') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(pattern, '51220') else 'No match'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Metacharacters, Character Classes and Quantifiers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Regular expressions typically contain characters that are treated as metacharacters:\n", "| Regular Expression Metacharacters |\n", "| :-: |\n", "|`[] {} () \\ * + - ^ $ ? .` |\n", "#### For example, in a regular expression, `\\d{5}` is a pattern. The `\\d` is is a **character class** that represents a _single_ digit character. The **quantifier** `{5}` says to match five consecutive digits. It's shorthand for `\\d\\d\\d\\d\\d`.\n", "#### An important metacharacter is the dot `.` which matches _any_ single character.\n", "#### If a metacharacter must appear in a regular expression as itself and not as a metacharacter, quote the character with `\\` (back slash). For example, `\\.` can only match a period or a decimal point in a regular expression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Valid' if re.fullmatch(r'\\d{5}', '02215') else 'Invalid'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "'Valid' if re.fullmatch(r'\\d{5}', '9876') else 'Invalid'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other Predefined Character Classes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Character class | Matches ... |\n", "| :-: | --- |\n", "| `\\d` | Any digit character `'0'` through `'9'`. |\n", "| `\\D` | Any character that is _not_ a digit. |\n", "| `\\s` | Any whitespace character such as spaces, tabs, and newlines. |\n", "| `\\S` | Any character that is _not_ a whitespace character. |\n", "| `\\w` | Any word (alphanumeric) character, including `_`. |\n", "| `\\W` | Any character that is _not_ a word character." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Custom Character Classes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Square brackets `[]` define a **custom character class** that matches a _single_ character. Examples:\n", "- `[aeiou]` matches any lower-case vowel letter.\n", "- `[aeiouAEIOU]` matches any lower- or upper-case vowel letter.\n", "- `[A-Z]` matches any upper-case letter.\n", "- `[a-z]` matches any lower-case letter.\n", "- `[a-zA-Z_]` matches any lower- or upper-case letter or the underscore.\n", "#### The quantifier `*` says to match _zero or more_ occurrences of the preceding subpattern. Therefore, \n", "```\n", "[A-Z][a-z]*\n", "```\n", "#### matches an upper-case letter followed by zero or more lower-case letters. `.*` matches a run of any combination of zero or more characters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Valid' if re.fullmatch('[A-Z][a-z]*', 'Wally') else 'Invalid'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Valid' if re.fullmatch('[A-Z][a-z]*', 'eva') else 'Invalid'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### When a custom character class starts with the caret `^` metacharacter, it says to match a character that is _not_ in the class. For example, `[^a-z]` matches any character that is _not_ a lower-case letter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch('[^a-z]', 'A') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch('[^a-z]', 'a') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch('[*+$]', '*') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch('[*+$]', '!') else 'No match'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### * vs. + Quantifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The quantifier `+` says to match _at least one_ occurrence of the preceding subpattern." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Valid' if re.fullmatch('[A-Z][a-z]+', 'Wally') else 'Invalid'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Valid' if re.fullmatch('[A-Z][a-z]+', 'E') else 'Invalid'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Greedy Quantifiers\n", "#### Both `*` and `+` are **greedy**. They specify matching as many characters as possible -- the longest possible substring." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(r'[A-Z]\\w*', 'BethAnn') else 'No match'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other Quantifiers\n", "#### Quantifier `?` matches _zero or one_ occurences of the preceding subpattern. In other words, it means \"optional\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch('labell?ed', 'labelled') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch('labell?ed', 'labeled') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch('labell?ed', 'labellled') else 'No match'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The quantifier `{`_n_`,}` says to match _at least n occurrences_." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(r'\\d{3,}', '123') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(r'\\d{3,}', '1234567890') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(r'\\d{3,}', '12') else 'No match'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The quantifier `{`_n_`,`_m_`}` says to match _between n and m occurrences_, inclusive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(r'\\d{3,6}', '123') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(r'\\d{3,6}', '123456') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(r'\\d{3,6}', '1234567') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'Match' if re.fullmatch(r'\\d{3,6}', '12') else 'No match'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##########################################################################\n", "# (C) Copyright 2019 by Deitel & Associates, Inc. and #\n", "# Pearson Education, Inc. All Rights Reserved. #\n", "# #\n", "# DISCLAIMER: The authors and publisher of this book have used their #\n", "# best efforts in preparing the book. These efforts include the #\n", "# development, research, and testing of the theories and programs #\n", "# to determine their effectiveness. The authors and publisher make #\n", "# no warranty of any kind, expressed or implied, with regard to these #\n", "# programs or to the documentation contained in these books. The authors #\n", "# and publisher shall not be liable in any event for incidental or #\n", "# consequential damages in connection with, or arising out of, the #\n", "# furnishing, performance, or use of these programs. #\n", "##########################################################################\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Additional material (C) Copyright 2024 by Ronald Mak" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }