Introduction

Last week, we began talking about reasoning in the presence of uncertainty.
We described a Decision Theory agent which tries maximize its expected utility.
We introduced the concept of sample spaces, elements and events in sample spaces.
We then described how distributions on variables can be used to come up with probabilities for propositional formulas based on these variables.
We distinguish between domain variables such as Weather which take on value that are subsets of the sample space (i.e., events), Weather=sunny, from random variables which are maps from the sample space to the reals. For example, Payoff(Heads) = 4, Payoff(Tails) = -4.
We looked at the notions of marginalization and conditioning that let us extract probability distributions of boolean variables from either joint distributions or conditional distributions.
Today, we begin by looking at what it means for domain variables in our sample space to be independent.

Independence

Suppose we added to our three variables a fourth variable Weather to get the full joint distribution `vec(P)(T\o\othache, Catch, Cavity, Weather)`. This table now has `2 times 2 times 2 times 4 = 32 entries`, four "editions" of the table of last day, one for each kind of weather.
What relationship do these editions have to each other and to the original three-variable table?
For example, how are `P(t\o\othache, catch, cavity, cloudy)` and `P(t\o\othache, catch, cavity)` related?
Using the product rule we know:
`P(t\o\othache, catch, cavity, cloudy) = P(cloudy | t\o\othache, catch, cavity) P(t\o\othache, catch, cavity)`
It is likely that the weather does not influence the dental variables. So it is safe to say:
`P(cloudy | t\o\othache, catch, cavity) = P(cloudy)`.
A similar equation exists for every entry in `\vec(P)(T\o\othache, Catch, Cavity, Weather)` so we get:
`\vec(P)(T\o\othache, Catch, Cavity, Weather) = \vec(P)(T\o\othache, Catch, Cavity)\vec(P)(Weather)`
This property is called independence. In our case, the weather is independent of our dental problems
Independence means a full joint distribution can often be factored into separate disjoint distributions. Independence can often help reduce the size of the domain representation and complexity of the inference problem.

Baye's Rule and Its Use

We already defined the product rules: `P(a ^^ b) = P(a | b)P(b)` and `P(a ^^ b) = P(b | a)P(a)`.
Equating the two right hand sides gives us:
`P(b|a) = frac(P(a|b)P(b))(P(a))`
This is called Baye's rule.
Baye's rule is used quite often in AI systems. The reason is that although `P(b|a)` might be hard to directly calculate the three terms on the right are often easier to determine
For example, we want to know the most likely cause of some effect. We could consider each cause and estimate:
`P(cause | effect) = frac(P(effect|cause)P(cause))(P(effect)).`
`P(effect | cause)` quantifies the relationship in the causal direction, whereas `P(cause | effect)` describes the diagnostic direction.
For example, a doctor often knows `P(symp\t\oms | disease)` but wants to calculate the disease that is causing the symptoms.
So if the doctor knows that 70% of people with meningitis have stiff necks and the odds of meningitis are 1/50000 and the odds of a stiff neck are 1/100. Then the odds of meningitis given a stiff neck are (.7 * 1/50000)/0.01 = 0.0014.

Probabilistic Agent Example

Example state of knowledge about a Wumpus World

Recall the Wumpus World game is played on a 4x4 grid starting at the lower left square.
We want to find the gold, avoid pits and the Wumpus (neither of which move), and leave the world with the gold.
Our knowledge about the world is given by sensors which can detect a breeze, a glitter, a stench, or a scream (if wumpus dies) in the adjacent square.
In one turn, we can move forward, turn left/right, grab gold, or shoot.
The above possible world might represent a situation that could cause a purely logical agent to get stuck the three available squares that have not been visited all might contain a pit. So which should we choose?
We would like to use probabilities to choose the one least likely to contain a pit.

Modeling the Problem

As in the logical case we will use the following variables:
- `P_(ij)` - represents square `(i,j)` has a pit.
- `B_(ij)` - represents `(i,j)` is breezy.
To figure out what to do next we need to specify the full joint distribution `vec P (P_(11), ..., P_(44), B_(11), B_(12), B_(21))`.
This can be calculated using
`vec P (P_(11), ..., P_(44), B_(11), B_(12), B_(21)) = vec P(B_(11), B_(12), B_(21)|P_(11), ..., P_(44)) vec P(P_(11), ..., P_(44))`
Each square contains a pit with probability 0.2 independently of all the other squares.
So we have:
`vec P(P_(11), ..., P_(44)) = prod_(i,j =1,1)^(4,4) vec P (P_(ij))`.
If the wumpus world has exactly `n` pits this gives:
`P(P_(11), ..., P_(44)) = (0.2)^n(0.8)^(16-n)`.

Answering the Question

To answer the query we want to sum over entries from the full joint distribution.
Let `Unknown` be the set of `P_(ij)` variables for squares other than `Known` squares and the query square.
So as in our conditioning example earlier we have:
`vec P(P_(13) | known, b) = alpha sum_(unknown) vec P(P_(13), unknown, known, b)`
There are 12 unknown squares, so the above sum contains 4096 terms.
Not all squares are relevant to the probability on the left. For example `(4,4)` does not affect whether `(1,3)` has a pit.
Let Frontier denote the pit variables that are adjacent to visited squares. i.e., (2,2) and (3,1).
Let Other be the pit variables for the other unknown squares (10 in this case).
We can manipulate the sum above to make use of the fact that the observed breezes are conditionally independent of the other variables, given the known, frontier, and query variables:
`vec P(P_(13) | known, b)`
`= alpha sum_(unknown) vec P(P_(13), known, b, unknown)`
`= alpha sum_(unknown) vec P(b|P_(13), known, unknown) vec P(P_(13), known, unknown)`
`= alpha sum_(\f\rontier)sum_(other) vec P(b|known,P_(13), \f\rontier, other) vec P(P_(13), known, \f\rontier, other)`
`= alpha sum_(\f\rontier)sum_(other) vec P(b|known,P_(13), \f\rontier) vec P(P_(13), known, \f\rontier, other)`
The last step uses the conditional independence.
The first term above does not depend on the Other variables, so we can move the summation inwards giving:
`vec P(P_(13) | known, b)`
`= alpha sum_(\f\rontier)vec P(b|known,P_(13), \f\rontier)sum_(other)vec P(P_(13), known, \f\rontier, other)`
By the independence of pit variables, we can factor the right hand sum and reorder things:
`vec P(P_(13) | known, b)`
`= alpha sum_(\f\rontier)vec P(b|known,P_(13), \f\rontier)sum_(other)vec P(P_(13))P(known)P(\f\rontier)P(other)`
`= alpha P(known) P(P_(13)) sum_(\f\rontier)vec P(b|known,P_(13), \f\rontier)P(\f\rontier)sum_(other)P(other)`
`= alpha' P(P_(13)) sum_(\f\rontier)vec P(b|known,P_(13), \f\rontier)P(f\rontier)`
In the last step we use `sum_(other) P(other) = 1`.
The above sum has just four terms.
`vec P(b|known,P_(13), \f\rontier)P(f\rontier)` is 1 when the frontier is consistent with the breeze observation and 0 otherwise.
So we sum over the logical models that are consistent with the known facts.
This gives us the models shown in the figure above.
Working it out, we have: `vec P(P_(13) | known, b) = alpha' langle 0.2 (0.04 +.16 +.16), 0.8 (0.04 +0.16) rangle approx langle 0.31, 0.69 rangle`

Quiz

Which of the following is true?

Default logics make use of circumscribed predicates.
`P(Total=11|Die_1 = 5)` is an example of what we called a prior probability.
Marginalization lets us compute the probability distribution of a single boolean variable from a joint probability distribution.

Learning

An agent is learning if it improves its performance on future tasks after making observations about the world.
Agents that can learn are useful for several reasons:
- A programmer cannot necessarily anticipate all possible situations that an agent might encounter.
- A programmer cannot necessarily predict all changes which may occur over time.
- A programmer might not have any idea how to program a solution to some problem themselves.
To begin our study of learning we look at some different kinds of learning, then we focus on supervised learning, and, in particular, learning for decision trees.

Forms of Learning

Improvements to components of an agent, and techniques used to make them, depend on:
- Which component is improved
- What prior knowledge the agent already has
- What representation is used for the data and the component.
- What feedback is available to learn from

Components to be Learned

We have described several agents so far this semester. The components of these agents which might be improved include:

The mapping from conditions on the current state to actions
The means to infer relevant properties of the world from percept sequence
Information about the way the world evolves and about the results of possible actions the agent can take.
Utility information indicating the desirability of world states
Action-value information indicating the desirability of actions
Goals that describe classes of states whose achievement maximizes the agent's utility.

The book gives examples of the above in terms of a taxi driver agent.

Representation and Prior Knowledge

There are many possible representations for agent components: propositional or first-order sentences; we could develop our probabilistic model into somethings called a Bayesian network, and so on.
Effective learning algorithms are known when knowledge is represented in each of these systems,
We will be interested in the case where knowledge is in a factored representation: A vector of attribute values.
When we go from input-output pairs and attempt to learn a general function that governs these pairs, we say we are doing inductive learning.
If we instead start from general rules, and derive things logically entailed from them, we say we are doing deductive or analytic learning.

Feedback to learn from

There are three main types of feedback that correspond to the three main types of learning:

In unsupervised learning the agent learns patterns in the input even though no explicit feedback is supplied. A common task in this area is clustering: detecting potentially useful clusters of input examples.
In reinforcement learning the agent learns from a series of reinforcements -- rewards or punishments. For example, the lack of a tip at the end of the journey gives the taxi agent an indication that it did something wrong. Winning a chess game give an agent an indication is did something right.
In supervised learning the agent observes some examples input-output pairs and learns a function that maps from input to output.

In addition, to the above there is also things such as semi-supervised learning where we are given a few labeled examples and must make what we can of a large collection of unlabeled examples.

Unsupervised Learning

For the rest of this class we will mainly be interested in supervised learning algorithms.
Still, it is not hard to describe a simple mean-based hierarchal clustering algorithm, so let me do that one this slide.
Suppose the data we want to learn clusters from example vectors `E_i = (e_(i1), ..., e_(i\n))` in `RR^n`.
There are several possible distance functions we might choose between such vectors. For example, Euclidean distance square:
`||E_i - E_j||^2 := sum_(k=0)^(n)( e_(ik) - e_(jk) )^2`.
Suppose we have a data set of `N` examples. To cluster these, we use a data structure that maintains a forest of labelled trees. This forest is initialized to a list of single node trees labeled with `E_1, ... E_N`.
Our algorithm has a while loop that cycles while our forest has more than one tree.
In the body of this while we cycle over pairs of trees and compute the square euclidean distance of the root labels of these trees to find the two trees which are nearest each other.
Once we have found these, we delete these two trees `T_i`, `T_j` from our forest. Let `E_i`, `E_j` denote the vectors labeling the roots of these trees. Let `E' = (frac(e_(i1) + e_(j1))(2), ..., frac(e_(i\n) + e_(jn))(2))`. We make a new tree `T'` with root label `E'` and subtree `T_i` and `T_j` and insert this tree into our forest.
When the algorithm stops we have a tree. If we want the `k` most important clusters of our examples, we stop the algorithm when the forest has exactly `k` trees in it, and look at the examples in the leaves of these trees.

Probabilistic Inference, Probabilistic Agents

Outline