San Jose State University Department of Applied Data Science
**DATA 200 Computational Programming for Data Analytics**
Spring 2023 Instructor: Ron Mak
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8.12.1 The `re` Module and Function `fullmatch()`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### An important task of machine learning is extracting critcal data from text that we want to analyze. We discover a common _pattern_ in the critical data, and we use that pattern to search the text for the data. The Python `re` module supports **regular expressions**. A regular expression is a string that represents a pattern, and it contains special **metacharacters** that enable powerful searches. Regular expressions allow you to extract data from unstructured text such as social media posts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Matching Literal Characters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We can use the regular expression function `fullmatch()` to test whether the _entire string_ value of its second argument matches the value of the regular expression pattern of its first argument."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pattern = '02215'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(pattern, '02215') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(pattern, '51220') else 'No match'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Metacharacters, Character Classes and Quantifiers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Regular expressions typically contain characters that are treated as metacharacters:\n",
"| Regular Expression Metacharacters |\n",
"| :-: |\n",
"|`[] {} () \\ * + - ^ $ ? .` |\n",
"#### For example, in a regular expression, `\\d{5}` is a pattern. The `\\d` is is a **character class** that represents a _single_ digit character. The **quantifier** `{5}` says to match five consecutive digits. It's shorthand for `\\d\\d\\d\\d\\d`.\n",
"#### An important metacharacter is the dot `.` which matches _any_ single character.\n",
"#### If a metacharacter must appear in a regular expression as itself and not as a metacharacter, quote the character with `\\` (back slash). For example, `\\.` can only match a period or a decimal point in a regular expression."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Valid' if re.fullmatch(r'\\d{5}', '02215') else 'Invalid'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"'Valid' if re.fullmatch(r'\\d{5}', '9876') else 'Invalid'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other Predefined Character Classes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Character class | Matches ... |\n",
"| :-: | --- |\n",
"| `\\d` | Any digit character `'0'` through `'9'`. |\n",
"| `\\D` | Any character that is _not_ a digit. |\n",
"| `\\s` | Any whitespace character such as spaces, tabs, and newlines. |\n",
"| `\\S` | Any character that is _not_ a whitespace character. |\n",
"| `\\w` | Any word (alphanumeric) character, including `_`. |\n",
"| `\\W` | Any character that is _not_ a word character."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Custom Character Classes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Square brackets `[]` define a **custom character class** that matches a _single_ character. Examples:\n",
"- `[aeiou]` matches any lower-case vowel letter.\n",
"- `[aeiouAEIOU]` matches any lower- or upper-case vowel letter.\n",
"- `[A-Z]` matches any upper-case letter.\n",
"- `[a-z]` matches any lower-case letter.\n",
"- `[a-zA-Z_]` matches any lower- or upper-case letter or the underscore.\n",
"#### The quantifier `*` says to match _zero or more_ occurrences of the preceding subpattern. Therefore, `[A-Z][a-z]*` matches an upper-case letter followed by zero or more lower-case letters. `.*` matches a run of any combination of zero or more characters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Valid' if re.fullmatch('[A-Z][a-z]*', 'Wally') else 'Invalid'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Valid' if re.fullmatch('[A-Z][a-z]*', 'eva') else 'Invalid'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### When a custom character class starts with the caret `^` metacharacter, it says to match a character that is _not_ in the class. For example, `[^a-z]` matches any character that is _not_ a lower-case letter."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch('[^a-z]', 'A') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch('[^a-z]', 'a') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch('[*+$]', '*') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch('[*+$]', '!') else 'No match'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### * vs. + Quantifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The quantifier `+` says to match _at least one_ occurrence of the preceding subpattern."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Valid' if re.fullmatch('[A-Z][a-z]+', 'Wally') else 'Invalid'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Valid' if re.fullmatch('[A-Z][a-z]+', 'E') else 'Invalid'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Greedy Quantifiers\n",
"#### Both `*` and `+` are **greedy**. They specify matching as many characters as possible -- the longest possible substring."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(r'[A-Z]\\w*', 'BethAnn') else 'No match'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other Quantifiers\n",
"#### Quantifier `?` matches _zero or one_ occurences of the preceding subpattern. In other words, it means \"optional\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch('labell?ed', 'labelled') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch('labell?ed', 'labeled') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch('labell?ed', 'labellled') else 'No match'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The quantifier `{`_n_`,}` says to match _at least n occurrences_."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(r'\\d{3,}', '123') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(r'\\d{3,}', '1234567890') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(r'\\d{3,}', '12') else 'No match'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The quantifier `{`_n_`,`_m_`}` says to match _between n and m occurrences_, inclusive."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(r'\\d{3,6}', '123') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(r'\\d{3,6}', '123456') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(r'\\d{3,6}', '1234567') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"'Match' if re.fullmatch(r'\\d{3,6}', '12') else 'No match'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"##########################################################################\n",
"# (C) Copyright 2019 by Deitel & Associates, Inc. and #\n",
"# Pearson Education, Inc. All Rights Reserved. #\n",
"# #\n",
"# DISCLAIMER: The authors and publisher of this book have used their #\n",
"# best efforts in preparing the book. These efforts include the #\n",
"# development, research, and testing of the theories and programs #\n",
"# to determine their effectiveness. The authors and publisher make #\n",
"# no warranty of any kind, expressed or implied, with regard to these #\n",
"# programs or to the documentation contained in these books. The authors #\n",
"# and publisher shall not be liable in any event for incidental or #\n",
"# consequential damages in connection with, or arising out of, the #\n",
"# furnishing, performance, or use of these programs. #\n",
"##########################################################################\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Additional material (C) Copyright 2023 by Ronald Mak"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 4
}