Data and Process Integration, Search




CS257

Chris Pollett

Nov 23, 2020

Outline

Introduction

Approaches to Process Management

Orchestration Pattern Choreography Pattern

Combining Data and Process Integration

Data and Process Integration

Data-related Infrastructure Services

Quiz

Which of the following is true?

  1. An Outrigger table is only used during the ETL process of Data Warehousing.
  2. Data Lake and Data Mart are synonyms for Data Warehouse.
  3. Data propagation is the data integration design pattern whereby updates are either synchronously or asynchronously propagated from a source to target system.

Searching Unstructured Data and Enterprise Search

Principles of Full Text Search

What search engines do not do...

Steps a search engine typically does

  1. Downloads as much of the web as it can before ever serving the results of any search query.
  2. Extracts words from the text on the pages it has downloaded.
  3. Creates a big index associating each word found with a list of documents containing that word.

At this point, the search engine is ready to handle queries. To handle a query it might:

  1. Look up each word in the query in the word-document index
  2. Intersect the list of documents found for each word to produce a list of documents each of which has all of the words.
  3. Group related documents.
  4. Try to order the documents by how relevant they seem to be to the query.

A Diagram of Search Engine Parts

The fetcher, indexer, and web components of a search engine

You might have several queue servers each responsible for a range of URL hash values. Similarly, one would typically have several indexers.

Downloading the Web

Maintaining what to crawl next.

How to decide what to crawl next.

How to keep track of what we've already seen.

Preprocessing and Indexing

Search Engine Algorithms

More on Page Rank

Tweaks on the basic matrix