{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "###
San Jose State University
Department of Applied Data Science

**DATA 200
Computational Programming for Data Analytics**

Spring 2023
Instructor: Ron Mak
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8.12.3 Other Search Functions; Accessing Matches" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Function `search()` — Finding the First Match Anywhere in a String" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The regular expression function `search()` finds the first occurence of a substring match and returns a **match object** whose `group()` function returms the matching substring. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = re.search('Python', 'Python is fun')\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.group() if result else 'not found'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result2 = re.search('fun!', 'Python is fun')\n", "result2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result2.group() if result2 else 'not found'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ignoring Case with the Optional flags Keyword Argument" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Set the keyword parameter `flag` to `INGNORECASE` to ignore case during regular expression matches." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result3 = re.search('Sam', 'BILL SAM WHITE', flags=re.IGNORECASE)\n", "result3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result3.group() if result3 else 'not found'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Metacharacters that Restrict Matches to the Beginning or End of a String" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The `^` metacharacter at the beginning of a regular expression (and not inside the square brackets of a custom character class) is an anchor that restricts matches only to the _beginning_ of a string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = re.search('^Python', 'Python is fun')\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.group() if result else 'not found'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = re.search('^fun', 'Python is fun')\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.group() if result else 'not found'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The `$` metacharacter at the end of a regular expression is an anchor that restricts matches only to the end of a string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = re.search('Python$', 'Python is fun')\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.group() if result else 'not found'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = re.search('fun$', 'Python is fun')\n", "result" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.group() if result else 'not found'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Functions `findall()` and `finditer()` — Finding All Matches in a String" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Regular expression function `findall()` returns a list of all the matching substrings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "contact = 'Wally White, Home: 555-555-1234, Work: 555-555-4321'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "re.findall(r'\\d{3}-\\d{3}-\\d{4}', contact)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Regular expression function `finditer()` is similar, except that it returns a **lazy iterator** that supplies matching substrings one at a time when requested. This is ideal if there are many matches and memory usage is a concern." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "re.finditer(r'\\d{3}-\\d{3}-\\d{4}', contact)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for phone in re.finditer(r'\\d{3}-\\d{3}-\\d{4}', contact):\n", " print(phone.group())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Capturing Substrings in a Match" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use the paretheses metacharacters to \"capture\" matching substrings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text = 'Charlie Cyan, e-mail: demo1@deitel.com'\n", "\n", "pattern = r'([A-Z][a-z]+ [A-Z][a-z]+), e-mail: (\\w+@\\w+\\.\\w{3})'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = re.search(pattern, text)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The match object's `groups()` function returns a list of all the captured substrings that matched." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.groups()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The match object's `group()` function returns returns the _entire_ match as a single string." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.group()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Pass an index value as an argument to `group()` to access individual captured substrings. Unlike array indexing, captured substrings are indexed from 1." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.group(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.group(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "##########################################################################\n", "# (C) Copyright 2019 by Deitel & Associates, Inc. and #\n", "# Pearson Education, Inc. All Rights Reserved. #\n", "# #\n", "# DISCLAIMER: The authors and publisher of this book have used their #\n", "# best efforts in preparing the book. These efforts include the #\n", "# development, research, and testing of the theories and programs #\n", "# to determine their effectiveness. The authors and publisher make #\n", "# no warranty of any kind, expressed or implied, with regard to these #\n", "# programs or to the documentation contained in these books. The authors #\n", "# and publisher shall not be liable in any event for incidental or #\n", "# consequential damages in connection with, or arising out of, the #\n", "# furnishing, performance, or use of these programs. #\n", "##########################################################################\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Additional material (C) Copyright 2023 by Ronald Mak" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }