Introduction

Before the break we were looking at:
- Data integration - the general set of techniques that allow us to create a unified view of data from different sources.
- Process Integration - the integration and harmonization of business processes in an organization.
We looked at design and implementation patterns for each:
- For data integration, we considered: Data consolidation, Data federation, Data propagation, Changed Data Capture, Data Virtualization, Data service composition via Service Oriented Architectures.
- For process integration, we considered: the orchestration pattern where we assume a single centralized executable business process and the choreography pattern where the business process participants coordinate there own collaborations
We then looked at how Process Integration and Data Integration can be viewed in terms of software layers and looked at individual software services which might appear in these different layers: Workflow Services, Activity Services, Data Services.
We said data services related to providing unified access to semi-structured and unstructured data often need different techniques than for structured data.
This led us to start discussing search engines as an interface into unstructured data. We looked at underlying search algorithms and data structures for search such as the OPIC algorithm for web crawling and inverted indexes for full-text indexing.
Today, we begin by narrowing our focus to search in the enterprise setting.

Enterprise Search

Enterprise search - the practice of making the content stemming from various distributed data sources (databases, but also and files) in an organization searchable.
The technologies used in enterprise search are closely related to those of web base search and a number of Apache based projects are frequently used in this domain.
Last day, we said that for the web one typically needs a crawler to locate and download resources to index.
There are open source projects for this (besides my own search engine Yioop used in this class). Perhaps the most popular is Apache Nutch developed by Doug Cutting in 1999.
In the enterprise space, finding the content is less of an issue than indexing it.
For indexing, the Lucene project (originally combined with Nutch) is often used.
Lucene can ingest a variety of document formats: HTML, PDF, Word, etc in a highly configurable way and output an inverted index that supports query expressions such as keyword search for index text fields, Boolean Operators, wildcard matching, proximity matching, range searches, etc.
Lucene provides API access to its inverted index -- it doesn't provide a User Interface -- the Solr project provides this facility.
Elasticsearch is a popular enterprise search engine built on top of Lucene that provides APIs to do distributed searches, grouping, and aggregation queries across multiple Lucene indexes on different machines.
Elasticsearch can be combined with ETL tool Logstash for loading data into a backend servers (from whence it can be indexed by Lucene) and with an analytics tool such as Kibana for visualizing groupings and aggregations.

Data Quality and Master Data Management

Data integration is closely related to data quality (fitness for use to a context).
Data quality can be measured across a number of different dimensions:
- Accuracy How correctly are the data values stored? (Are names of customers spelt correctly, etc.)
- Completeness - How much of the required data or metadata is captured? (For example, are all customer's dates of birth recorded?)
- Consistency - How well handled redundant and duplicate values for the same concept? (Names of cities and states in addresses should be written the same way.)
- Accessibility - How easy is it to retrieve the data?
- Timeliness - How up-to-date for a given task is the data at hand ?
Data integration can aid in improving data quality, but might also hamper it.
For example, ETL transformations and cleansing operations should yield higher quality data, but also their presence might make people lazy and avoid spending on improving the original sources (DBMSs, legacy systems) or to reduce and simplify the number of these sources.
Master data management (MDM) comprises a series of processes, policies, standards, and tools to help organizations define and provide a single point of reference for all data that are "mastered".
I.e., this is the job of deciding and managing the data integration process and ensuring data quality.
This involves also thinking and implementing things through to avoid spaghetti soup half-integrated data silos.

Data Governance

As part of master data management, large organizations often have system-wide data governance initiatives to handle concerns related to data quality and integration.
Below are some notable data governance standards and frameworks:
- Total Data Quality Management (TDQM) - this is a four cycle process to improve data quality, as illustrated by the figure above, by definition (of quality dimensions), measurement (of percentages of data of given quality across each dimension), analysis (of causes of lack of data quality), and improvement (specific actions such as random input audits to be taken to fix lack of data quality)
- Capability Maturity Model Integration (CMMI) is a business process improvement model developed at Carnegie Mellon and required by many US government contracts. It categorizes maturity of a business process into one of five levels: (Level 1) Performed - reactive, data are collected as needed with little discipline and modified to need, (Level 2) Managed - some processes are in place to control quality and monitor data, (Level 3) Defined - Data policies in place to ensure data meets specific needs and data quality is predictable, (Level 4) Measured - fully managed policies and specs given for governing quality of data, and (Level 5) Optimized - (4) + organization is continuously working to improve its data quality initiatives.
- Data Management Body of Knowledge (DMBOK) this is a list of best practices for data quality management, metadata management, data warehousing, data integration, and data governance overseen by DAMA International (the Data Management Association).
- Control Objectives for Information and Related Technology (COBIT) is a more general framework for mapping business goals to IT goals created by the ISACA (Information Systems Audit and Control Association). It splits this mapping into five parts: establish the framework of IT governance as related to business goals, process descriptions, control objectives, management guidelines, and maturity models.
- Information Technology Infrastructure Library is a set of detailed practices for IT service management that focuses on aligning IT services with the needs and requirements of business.

Quiz

Which of the following is true?

In a process execution language, a data dependency is a dependency where execution of Service A must complete before execution of Service B.
The online page importance algorithm can be used to say what URL a web crawler should crawl next.
To compute a conjunctive query, one should return all the documents that have at least one term from the query.

Big Data

Big Data usually refers to datasets with volumes beyond the ability of common tools to store, manage or process in a reasonable amount of time.
The term was coined by John R Mashey from SGI (which became part of Rackable then part of HP Enterprise) in 1998.
Five characteristics that scope what is big data are:
- Volume - the amount of data, also referred to the data "at rest".
- Velocity - the speed at which data come in and go out, data "in motion".
- Variety - the range of data types and sources that are used, data in its "many forms".
- Veracity - the amount of data "in doubt" due to data inconsistency and incompleteness, to ambiguities present in the data, as well as latency or certain data points that might be derived from estimates or approximations.
- Value - the economic value of the data as quantified using the total cost of ownership (TCO) and return on investment (ROI).

Examples of Big Data

Some example scenarios with large volumes of data are: social graph data from sites like Facebook, Twitter, LinkedIn, Weibo, and WeChat including associated user preferences, data from internet of things (IoT) devices, self-driving car data such as from Tesla Autopilot, real-time data from mobile phones, airplane engines, traffic moniters, etc.
Variety can be illustrated by considering beyond employee name or employee date of birth what other information a business might keep about its employees: images, fingerprints, tweets, emails, Facebook pages, MRI scans, sensor data, GPS data, and so on... a lot of these being unstructured.
Velocity is often illustrated in terms of real-time analytic systems for things like: online trades, Youtube traffic, credit card swipes, phone calls, etc.
For some big data, such as internet posts we mitth have only a slight ability to quantify quality or trustworthiness of the data. I.e., the veracity of the data.
Just because we can store the data doesn't necessarily mean we can cost effectively analyze it, so one needs to consider trades offs between what we remember and analyze versus the return on that investment.

Hadoop

Hadoop is an open-source software framework used for distributed storage and processing of big datasets.
More specifically, Hadoop is an open source reimplementation of the Google File System and Map Reduce created in 2003 by Doug Cutting and Mike Cafarella then at Yahoo.
Currently, it is an Apache project.
The reason why Hadoop is useful in Big Data situations is that systems using it can be built from normal, commodity hardware, instead of requiring specialized, expensive machines.
Further, it is designed with the fundamental assumption that hardware failures are common occurrences and should be gracefully handled.
To give a flavor of an coding with Hadoop we present the Map Reduce algorithm for word counting that we considered earlier in terms of Hadoop.
You can use apt, brew, choco to get an installation of Hadoop suitable for single machine development or testing. For example, I did
```
brew install hadoop
```
to install Hadoop on my laptop.
From the command line one runs Hadoop with lines like:
```
hadoop some-command args
```
For example, if some command is jar then it runs the map-reduce job given by the arguments.
Without any arguments hadoop gives the list of possibilities for some-command, one interesting subset of which is fs, which indicates a command for the hadoop distributed file system.
Hadoop has as one of its components YARN (Yet Another Resource Negotiator) which is used to manage and locate resources which might be distributed across several machines. YARN can be run from the command line using the command yarn.

Word Count as a Hadoop Map Reduce job

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount
{
    /* This inner class is used for our Map job
       First two params to Mapper are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCMapper extends Mapper<Object,Text,Text,IntWritable>
    {
        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException
        {
            // normalize document case, get rid of non word chars
            String document = value.toString().toLowerCase()
                .replaceAll("[^a-z\\s]", "");
            String[] words = document.split(" ");
            for (String word : words) {
                Text textWord = new Text(word);
                IntWritable one = new IntWritable(1);
                context.write(textWord, one);
            }
        }
    }
    /* This inner class is used for our Reducer job
       First two params to Reducer are types of KeyIn ValueIn
       last two are types of KeyOut, ValueOut
     */
    public static class WCReducer extends Reducer<Text, IntWritable,
        Text, IntWritable>
    {
        public void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException
        {
            int sum = 0;
            IntWritable result = new IntWritable();
            for (IntWritable val: values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }
    public static void main (String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "wordcount");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        TextInputFormat.addInputPath(job, new Path(args[0]));
        Path outputDir = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, outputDir);
        FileSystem fs = FileSystem.get(conf);
        fs.delete(outputDir, true);
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Compiling and Executing WordCount.java on Hadoop

To compile WordCount.java we need to make sure we have all the relevant classpaths for the Hadoop classes we want.

To do this we use YARN:

javac -classpath `yarn classpath` -d . WordCount.java

We then make a jar file out of our compiled program:

jar -cf WordCount.jar WordCount.class 'WordCount$WCMapper.class' 'WordCount$WCReducer.class'

We can then run the program on hadoop:

hadoop jar WordCount.jar WordCount /Users/cpollett/tmp.txt /Users/cpollett/output

Here tmp.txt is the file we want to count the words for, /Users/cpollett/output is where we are outputting the results in HDFS format for the job.
After running the job, we could see what files were in the output folder with:
```
hadoop fs -ls /Users/cpollett/output/
```
Finally, if we want to see the counts, we could type:
```
hadoop fs -cat /Users/cpollett/output/part-r-00000
```

In my case, this outputs:

2018-12-05 11:28:01,201 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	1
foo	2
la	3

Databases on top of Hadoop

The Map Reduce framework in a distributed setting can have a high learning curve, so almost from day one people started to try to develop more SQL like interface for Hadoop such as: HBase (basic above Map Reduce), Pig (more SQL-like and mapping to map reduce under the hood).
There have also been attempts to build data warehousing systems on top of Hadoop such as Hive.
We won't go into any detail on these systems, but to give you at least a flavor of what HBase looks like here are some example commands
On my Mac, I got hbase, by doing brew install hbase, then started the server with the command
```
start-hbase.sh
```
To get a shell to experiment with, I typed the command:
```
hbase shell
```

Here are some example hbase commands:

//Making a table user with columns name and email
// In general, columns have the format 
// families:qualifiers such as name:first, name:last 

create 'users', 'name', 'email'

//list the tables we have in our system
list
// or for a particular
list 'users'

// to see details about how the table is contructed one can do:
describe 'users'

//To add data to a column of a particular user, I could do
put 'users', 'seppe', 'name:last', 'vanden Broucke'
//or
put 'users', 'seppe', 'email', 'seppe.vandenbroucke@kuleuven.be'

// To see what's stored I can do:
get 'users', 'seppe'

// To delete some column info can do:
delete 'users', 'seppe', 'name:first'

// hbase supports versioning so can have expression like:
alter 'users', {NAME => 'email', VERSIONS => 3}
get 'users', 'seppe', {COLUMNS => ['email'], VERSIONS => 2}

// To get rid of everything related to a particular row can do:
deleteall 'users', 'seppe'

//to get rid of a table
disable 'users'
drop 'users'

Apache Spark

The Apache Spark project is a simplified programming paradigm to make it easier to analyse big data quicker.
It was orginally developed at UC Berkeley in 2014 (Zaharia, et al) and can be thought of as an alternative to traditional Map Reduce frameworks.
It makes use of a Resilient Distributed Dataset (RDD) data structure which is essentially a read-only, partitioned collection of records.
RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs
We can manipulate RDDs via sequences of operations such as: map, filter, and join.
Outputs in such a sequence of operations need not be materialized (different from Hadoop), but the lineage of any RDD is tracked and allows it to be materialized if needed.

Apache Spark Example

The next example is taken from te original paper on Spark paper.
Suppose an operator wants to search terabytes of logs in the Hadoop filesystem (HDFS) to find the cause of a web service error.

Using Spark, the operator loads the error messages from the logs into RAM across a set of nodes and queries them interactively:

lines = spark.textFile("hdfs://...")
errors = lines.filter(_.startsWith("ERROR"))
errors.persist() // make the errors available in RAM

To see the number of errors, we can do:
```
errors.count()
```

You can also perform further transformations to get more details about the errors:

// find MySQL errors
errors.filter(_.contains("MySQL")).count()

// Return the time fields of errors mentioning
// HDFS as an array (assuming time is field
// number 3 in a tab-separated format):
errors.filter(_.contains("HDFS"))
   .map(_.split('\t')(3))
   .collect()

Enterprise Search, Data Governance, Big Data

Outline