Introduction

We introduced the complexity class `NP`.
`NP` was the class of decision problems that have a polynomial time computable verifier.
We then said a problem `D` was hard for `NP` if any problem in `NP` could be polynomial time reduced to `D`.
We said `D` was `NP`-complete if it was hard for `NP` and also happened to be in `NP`.
So far though we have not shown there are any problems which are actually `NP`-complete.
We start today by rectifying this situation.

`NP`-complete problems

Theorem.
`mbox(TMSAT) := {\langle alpha, w, 1^n, 1^t rangle |`
`\mbox(there exists a ) u in {0,1}^n mbox( such that ) M_alpha mbox( accepts w, u in at most t steps) }`
is Karp-complete for NP.

Proof. We first show `mbox(TMSAT) ` is in `NP`. A nondeterministic algorithm to recognize this language is as follows: On input `\langle alpha, w, 1^n, 1^t rangle`, nondeterministically guess a string `u` of length `n` and then simulate `M_alpha` according to this string for `t` steps and see if it accepts `langle w, u rangle`. To see this language is hard for `NP`, suppose `L` is an `NP` language. Then there is some verifier `M` such that `x in L` iff there is a `u in {0,1}^(p(n))` satisfying `M(x,u) = 1` and `M` runs in time `q(n)` for some polynomial `q`. To reduce `L` to `mbox(TMSAT)`, we simply map every string `x in {0, 1}^star` to `langle lfloor M rfloor, x, 1^(p(|x|)), 1^(q(m)) rangle`, where `m = |x| + p(|x|)`. This mapping is computable in `p`-time and
`langle lfloor M rfloor, x, 1^(p(|x|)), 1^(q(m)) rangle in mbox(TMSAT) iff `
`exists u in {0,1}^(p(n))mbox( such that ) M_alpha mbox( accepts w, u in at most ) q(m) mbox( steps ) iff`
`x in L`.

In-class Exercise

Below is a problem to machine learning. Argue that it is in NP. It's actually NP-complete (Brucker 1978), but you don't have to show completeness.

Clustering: Given a finite set `X`, a distance function `d(x,y)` which returns nonnegative integers for any inputs `x,y in X`, and two positive integers `k` and `B`, is there a partition of `X` into disjoint sets `X_1, ..., X_k` such that, for `1 \leq i \leq k` and all pairs `x,y in X_i`, `d(x, y) \leq B`?

Post your solution to the Feb 22 Class Thread.

Boolean Formulas

TMSAT is not a problem that people are asked to do every day.
We next want to look at the first problem that was actually shown to be `NP`-complete: SAT.
This example comes from propositional logic is perhaps more natural. To introduce it we need to introduce the notion of a Boolean formula.
Given a set of variables `u_1`, ..., `u_n` whose values can be `0` (false) or `1` (true), a boolean formula is either just one of these variables or is built from these variables using AND (`^^`), OR (`vv`), or NOT (`neg`).
A truth assignment `nu:[1, .. n] -> {0,1}` gives a value to `0` or `1` to the set of variables. So `nu(1) = 1`, also written `nu(u_1) = 1`, says that variable `u_1` has the value `1`. A truth assignment `nu` for variables can be extended in the natural way to a function `bar{nu}` which gives a `0` or `1` value to any boolean formula.
For example, under the assignment `nu(u_1) = 1`, `nu(u_2) = 0`, `nu(u_3) = 1`, `bar{nu}(u_1 ^^ u_2)` evaluates to `0`, and `bar{nu}((u_1 ^^ u_2) vv u_3)` evaluates to 1.

Satifiability, Unsatisfiability, and Validity

A formula is said to be satisfiable if some assignment to its input variables makes the formula output `1`.
Otherwise, the formula is said to be unsatisfiable.
A formula is said to be valid if all assignments to its input variables makes the formula output `1`.
For example, `(u_1 ^^ u_2) vv u_3` is satisfiable, `(u_1 ^^ neg u_1)` is unsatisfiable, and `(u_1 vv neg u_1)` is valid.
Notice if a formula is unsatisfiable then its negation is valid. One automated theorem proving technique is to take a statement and try to see if its negation has a formal refutation (i.e., its negation can be proven unsatisfiable).

Conjunctive Normal Form

A literal, `l_i`, is used to mean either a variable `u_i` or its negation `neg u_i`. We often write `neg u_i` as `bar u_i`.
A formula is said to be in conjunctive normal form, (CNF), if it is AND of ORs of variables or their negation.
For example, `(u_1 vv bar u_2 vv u_3) ^^ (u_2 vv bar u_3 vv u_4) ^^ (bar u_1 vv u_2 vv bar u_4)`
We often write CNF formulas like `^^_i(vv_j nu_(i_j))`
`vv_j nu_(i_j)` are called clauses.
Clauses are sometimes written as sets. So a clause like `(u_2 vv bar u_3 vv u_4)` would be written as `{u_2, bar u_3, u_4}`. A formula such as `(u_1 vv bar u_2 vv u_3) ^^ (u_2 vv bar u_3 vv u_4) ^^ (bar u_1 vv u_2 vv bar u_4)` would be written as a set of clause sets,
`{{u_1, bar u_2, u_3}, {u_2, bar u_3, u_4}, {bar u_1, u_2 , bar u_4}}`.
A kCNF formula is a CNF formula in which all clauses have at most `k` literals.
We denote by SAT the language of all satisfiable CNF formulas, and by 3SAT the language of all satisfiable 3CNF formulas.

The Cook-Levin Theorem (1971, 1973)

Theorem.
(1) SAT is `NP`-complete.
(2) 3SAT is `NP`-complete.

Proof. First notice, given a truth assignment, we can check in polynomial time is each clause in a CNF is satisfied or not. So both SAT and 3SAT are in NP. So it suffice to show they are hard for `NP`. We will prove this over the next couple of slides.

CNFs are universal

Claim. For every Boolean function `f:{0, 1}^l -> {0,1}`, there is an `l`-variable CNF formula `phi` of size `l2^l` such that `phi(u) = f(u)` for every truth assignment `u in {0, 1}^l`. Here the size of a CNF formula is defined to be the number of `^^`'s/`vv`'s appearing in it.

Proof. For each `v in {0,1}^l` we make a clause `C_v(u_1, .., u_l)` where `u_i` appears negated in the clause if bit `i` of `v` is `1` otherwise it appear un-negated. Notice this clause has `l` ORs. Also notice `C_v(v) = 0` and `C_v(u) =1` for `u ne v`. Using these `C_v`'s we can define a CNF formula for `f` as:
`phi = ^^_(v:f(v) = 0) C_v(u_1, .. u_l)`.
As there are at most `2^l` strings `u` which make `f(u)=0`, the total size of this CNF will be `l 2^l`.

Example of Converting to CNF

Consider the statement `(A wedge B) \vee (bar(A) wedge bar(B))\vee (A wedge neg B)`. This expresses `A ge B`.
We could express `A = max(A, B, C)` as `A ge B wedge A ge C`.
This is an AND of ORs of ANDs as written, so not CNF.
Its truth table looks:

A B C A = max(A, B, C)

1 1 1 1

1 1 0 1

1 0 1 1

1 0 0 1

0 1 1 0

0 1 0 0

0 0 1 0

0 0 0 1
To make a CNF, we look at the three false rows. If the variable `X` for a column has a 1 in it we take that variable, if it has a 0 we take `bar(X)`. So the row 0, 1, 0, becomes `A vee bar(B) vee C`.
This formula asserts that row didn't happen.
So the CNF for the whole formula is:
`(A vee bar(B) vee bar(C)) wedge (A vee bar(B) vee C) wedge (A vee B vee bar(C))`
As a collection of clauses we would write:
`{{A, bar(B), bar(C)}, {A, bar(B), C}, {A, B, bar(C)}}.`

A	B	C	A = max(A, B, C)
1	1	1	1
1	1	0	1
1	0	1	1
1	0	0	1
0	1	1	0
0	1	0	0
0	0	1	0
0	0	0	1

The NP-hardness of SAT is to be continued next day....

`NP`-completeness

Outline