### <center>San Jose State University<br>Department of Applied Data Science<br><br>**DATA 200<br>Computational Programming for Data Analytics**<br><br>Spring 2024<br>Instructor: Ron Mak</center>

## 8.12.1 The `re` Module and Function `fullmatch()`

#### An important task of machine learning is extracting critcal data from text that we want to analyze. We discover a common _pattern_ in the critical data, and we use that pattern to search the text for the data. The Python `re` module supports **regular expressions**. A regular expression is a string that represents a pattern, and it contains special **metacharacters** that enable powerful searches. Regular expressions allow you to extract data from unstructured text such as social media posts.

In [None]:
import re

### Matching Literal Characters

#### We can use the regular expression function `fullmatch()` to test whether the _entire string_ value of its second argument matches the value of the regular expression pattern  of its first argument.

In [None]:
pattern = '02215'

In [None]:
'Match' if re.fullmatch(pattern, '02215') else 'No match'

In [None]:
'Match' if re.fullmatch(pattern, '51220') else 'No match'

### Metacharacters, Character Classes and Quantifiers

#### Regular expressions typically contain characters that are treated as metacharacters:
| Regular Expression Metacharacters |
| :-: |
|`[]  {}  ()  \  *  +  -  ^  $  ?  .` |
#### For example, in a regular expression, `\d{5}` is a pattern. The `\d` is is a **character class** that represents a _single_ digit character. The **quantifier** `{5}` says to match five consecutive digits. It's shorthand for `\d\d\d\d\d`.
#### An important metacharacter is the dot `.` which matches _any_ single character.
#### If a metacharacter must appear in a regular expression as itself and not as a metacharacter, quote the character with `\` (back slash). For example, `\.` can only match a period or a decimal point in a regular expression.

In [None]:
'Valid' if re.fullmatch(r'\d{5}', '02215') else 'Invalid'

In [None]:
'Valid' if re.fullmatch(r'\d{5}', '9876') else 'Invalid'

### Other Predefined Character Classes

| Character class | Matches ... |
| :-: | --- |
| `\d` | Any digit character `'0'` through `'9'`. |
| `\D` | Any character that is _not_ a digit. |
| `\s` | Any whitespace character such as spaces, tabs, and newlines. |
| `\S` | Any character that is _not_ a whitespace character. |
| `\w` | Any word (alphanumeric) character, including `_`. |
| `\W` | Any character that is _not_ a word character.

### Custom Character Classes

#### Square brackets `[]` define a **custom character class** that matches a _single_ character. Examples:
- `[aeiou]` matches any lower-case vowel letter.
- `[aeiouAEIOU]` matches any lower- or upper-case vowel letter.
- `[A-Z]` matches any upper-case letter.
- `[a-z]` matches any lower-case letter.
- `[a-zA-Z_]` matches any lower- or upper-case letter or the underscore.
#### The quantifier `*` says to match _zero or more_ occurrences of the preceding subpattern. Therefore, 
```
[A-Z][a-z]*
```
#### matches an upper-case letter followed by zero or more lower-case letters. `.*` matches a run of any combination of zero or more characters.

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]*', 'Wally') else 'Invalid'

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]*', 'eva') else 'Invalid'

#### When a custom character class starts with the caret `^` metacharacter, it says to match a character that is _not_ in the class. For example, `[^a-z]` matches any character that is _not_ a lower-case letter.

In [None]:
'Match' if re.fullmatch('[^a-z]', 'A') else 'No match'

In [None]:
'Match' if re.fullmatch('[^a-z]', 'a') else 'No match'

In [None]:
'Match' if re.fullmatch('[*+$]', '*') else 'No match'

In [None]:
'Match' if re.fullmatch('[*+$]', '!') else 'No match'

### * vs. + Quantifier

#### The quantifier `+` says to match _at least one_ occurrence of the preceding subpattern.

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]+', 'Wally') else 'Invalid'

In [None]:
'Valid' if re.fullmatch('[A-Z][a-z]+', 'E') else 'Invalid'

### Greedy Quantifiers
#### Both `*` and `+` are **greedy**. They specify matching as many characters as possible -- the longest possible substring.

In [None]:
'Match' if re.fullmatch(r'[A-Z]\w*', 'BethAnn') else 'No match'

### Other Quantifiers
#### Quantifier `?` matches _zero or one_ occurences of the preceding subpattern. In other words, it means "optional".

In [None]:
'Match' if re.fullmatch('labell?ed', 'labelled') else 'No match'

In [None]:
'Match' if re.fullmatch('labell?ed', 'labeled') else 'No match'

In [None]:
'Match' if re.fullmatch('labell?ed', 'labellled') else 'No match'

#### The quantifier `{`_n_`,}` says to match _at least n occurrences_.

In [None]:
'Match' if re.fullmatch(r'\d{3,}', '123') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,}', '1234567890') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,}', '12') else 'No match'

#### The quantifier `{`_n_`,`_m_`}` says to match _between n and m occurrences_, inclusive.

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '123') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '123456') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '1234567') else 'No match'

In [None]:
'Match' if re.fullmatch(r'\d{3,6}', '12') else 'No match'

In [None]:
##########################################################################
# (C) Copyright 2019 by Deitel & Associates, Inc. and                    #
# Pearson Education, Inc. All Rights Reserved.                           #
#                                                                        #
# DISCLAIMER: The authors and publisher of this book have used their     #
# best efforts in preparing the book. These efforts include the          #
# development, research, and testing of the theories and programs        #
# to determine their effectiveness. The authors and publisher make       #
# no warranty of any kind, expressed or implied, with regard to these    #
# programs or to the documentation contained in these books. The authors #
# and publisher shall not be liable in any event for incidental or       #
# consequential damages in connection with, or arising out of, the       #
# furnishing, performance, or use of these programs.                     #
##########################################################################


In [None]:
# Additional material (C) Copyright 2024 by Ronald Mak