Introduction to NP-Completeness

Most of the algorithms we have studied in this class run in polynomial time or some randomized variant of polynomial time.
That is, on all inputs of length `n`, the algorithms we've considered run in time at most `O(n^k)` for some fixed `k`.
We'll start looking today at some problems for which it is unknown if such efficient algorithms exist.
To do this, first, we make formal what it is we mean by polynomial time, then we consider problem which might not be doable in polynomial time.

Abstract Problems

Let's fix a framework for describing problems and reasoning about their runtimes.
Define an abstract problem `Q` to consist of a set of instances `I` and a set of solutions `S`.
For example, for SHORTEST-PATH the instances might be triples consisting of graph and two vertices. A solution might be a sequence of vertices for a path between those two points in the graph of shortest distance.
We will be interested in a subclass of problems called decision problems, where the answers are always yes (1) or no (0).
For example, does there exists a shortest path of size at most `k`?
It is usually straightforward to binary search from a way to solve the decision problem to solve the associated optimization problem.
For example, we might first ask is there a path of length at most 1? If there is, we are done; if not, we double our choice of `k` and ask again. A some point we will find a `k` such that there is a path of length at most `2k`, but not of length `k`. We can then continue our binary search between `2k` and `k`.
Here optimization problems are where we want to find a largest or smallest value.

Encodings

An encoding of a set `S` of abstract objects is a mapping from `S` to binary strings.
For example, one can encode the natural numbers `{0, 1, 2, ...}` as strings `{0, 1, 10,..}`.
One can encode legal English sentences using ASCII, etc.
A computer algorithm "solves" some abstract decision problem by going from an encoding of a problem instance as an input to `0` or `1` as output.
We call a problem whose instance set is the set of binary strings a concrete problem.
We say an algorithm solves the problem in `O(T(n))` time if when provided a problem instance `i in I` of length `n = |i|`, the algorithm can produce the solution using `O(T(n))` steps.
A concrete problem is called polynomial time decidable if there is an algorithm that solves it which on all instances of length `n` runs in time `O(n^k)` for some fixed `k`.
We write `P` for the class of all such decision problems.
Similarly, we can define the class of polynomial computed functions `f:{0,1}^\star -> {0,1}^\star`.

Formal Languages

In order to study decision problems it is useful to have an understanding of formal languages.
An alphabet `Sigma` is a finite set of symbols.
A language is a set of strings over the symbols in an alphabet.
Some common ways to create new languages from old ones is via unions, concatenation, and star.

Polynomial-Time Verification

We now look at algorithms which can verify membership in languages.
As an example...
Call an undirected graph G Hamiltonian if it contains a Hamiltonian cycle that is, a simple cycle which contain each vertex of G.
Let HAM-CYCLE `= { langle G rangle | G mbox( is a Hamiltonian graph )}`
How might one decide this problem? One could try each possible permutation of vertices. Let `m` be the number of vertices of the graph. Typically, `m = Omega(sqrt(|langle G rangle|))`. There are `m!` many permutations. So this algorithm would have exponential runtime.
On the other hand, consider the language
`H = {langle G, P rangle | P mbox( is a Hamiltonian cycle in ) G}`.
This language has a polynomial time decision algorithm. Further, the size of `P` is polynomial in the size of `G`, so we could rewrite HAM- CYCLE as:
`{ langle G rangle | exists P, |P| le |G| and langle G, P rangle in H}`
`H` can be viewed as verifying HAM-CYCLE in polynomial time.

The complexity class NP

We are now ready to define the complexity class `NP`.
We say a language `L` belongs to `NP` if there exists a two input polynomial-time algorithm `A` and a constant `c` such that
`L= {x in {0,1}^star : exists y, |y| = O(|x|^c) mbox( and ) A(x,y)=1}`
I.e., it is the class of languages that have polynomial time verification algorithms. So HAM-CYCLE `in NP`.
It is not hard to see `P subseteq NP`, but it is unknown if `P=NP`.
In fact, there is a million dollar prize to anyone who can solve this problem.
Given a complexity class `C`, let `co-C` denote the class of languages whose complement is in `C`.
One can see `P subseteq NP cap co-NP`, but it is unknown if equality holds.

In-Class Exercise

Let `EXP` be the class of languages decided by deterministic algorithms on inputs of length `n` in time `O(2^{n^k})` for some fixed constant `k > 0`.

Prove `NP subseteq EXP`. Argue there is a language not in `EXP`, but which can be solved in time `O(2^{2^n})`.

Post your solution to the Apr 17 In-Class Exercise Thread.

Polynomial-Time Reducibility

There is some evidence to show that `P=NP` is unlikely.
Further many problems have been shown to be in NP.
So it is useful to be able to classify which NP problem are easy and which are hard.
To do this, we say a language `L_1` is polynomial-time reducible to language `L_2`, written `L_1 le_P L_2`, if there exists a polynomial time computable function `f:{0,1}^star -> {0,1}^star` such that for all `x in {0,1}^star`, `x in L_1` iff `f(x) in L_2`.

Lemma. If `L_1`, `L_2` are languages such that `L_1 le_P L_2` and `L_2` is in `P`, then `L_1` is in `P`.

Proof. Let `A(y)` decide `L_2` in time `O(p(|y|))`. Let `f(x)` be a `O(q(|x|))`-time reduction from `L_1` to `L_2`. Here `p` and `q` are polynomials. Then `B(x)` which first computes `f(x)` then runs `A(f(x))`, runs in `O(p(q(|x|))`-time and decides `L_1`. So `B` runs in polynomial time.

NP-completeness

The `P` languages in `NP` are the easy languages.
In contrast, a language `L` is called `NP`-complete if
1. `L in NP`, and
2. `L' le_P L` for every `L' in NP`.
A language which satisfies (2) but not necessarily (1) is called `NP`-hard.
Let `NPC` denote the class of `NP`-complete languages.

Theorem. If any `NP`-complete language is in `P`, then `P=NP`.

Proof. This follows from the lemma on the last slide.

A First NP-complete problem

Let CIRCUIT-SAT be the language:
`{langle C rangle | C` is an AND, OR, NOT circuit computing a 0-1 function which on some truth assignment to its input variables outputs 1`}`

Theorem. CIRCUIT-SAT is in NP.

Proof. Consider the following algorithm `A(langle C rangle, langle a rangle)`. First, `A` checks if `langle C rangle` is in the format of a circuit and `langle a rangle` is in the format for an assignment; if not, it rejects `A`. Otherwise, it then labels each of the inputs to `langle C rangle` with their value according to their values in `langle a rangle`. Then it loops over the combinational elements in `langle C rangle`, until there is no change doing the following:

Check if the current element is not assigned a value, but its children have been assigned a value.
Calculate the value of the node based on its gate type and its children.

By the `i`th iteration the nodes of depth `i` will have values. Each iteration involves less than quadratic work. So in `O((|langle C rangle|)^3)` this algorithm labels the root of the circuit with its output value on this assignment. Finally, CIRCUIT-SAT is the language `{langle C rangle in {0,1}^star : exists langle a rangle, |langle a rangle| le |langle C rangle| mbox( and ) A(langle C rangle, langle a rangle) = 1}`.

Cook's Theorem

Theorem. CIRCUIT-SAT is NP-hard.

Proof. Let `L` be a language in `NP`, let `A(x,y)` verify the language in time `O(|x|^c)`. The algorithm `A` runs on some kind of computational hardware. If that hardware is in a given configuration `c_i` then its control determines in the next time step what its next configuration `c_(i+1)` will be. We assume that this mapping can be computed by some AND, OR, NOT circuit `M` implementing the computer hardware. Using this circuit `M`, we build an AND, OR, NOT circuit `langle C(y) rangle` which is split into main layers which have the properties:

The output of `C` at main layer 1 codes, `c_0` , a configuration of `M` at the start of the computation of `A(x,y)`. Here the values of `x` are hard-coded based on the instance `x` which we are trying to check is in `L`. `y` is not hard-coded and boolean variables are used to represent it.
For each `i`, the output of `C` at main layer `i + 1`, corresponds to the configuration obtained from main layer `i` by computing according to `M`.
The output of `C` is the value extracted from the final configuration of `A` after `O(|x|^c)` steps.

Since there are polynomially many main layers each separated by polynomial-sized circuits, this whole circuit will be polynomial-size. If there is some setting of the boolean variables for `y` which makes the circuit true, then `A(x,y)` holds and `x` will be in `L` as desired.

NP-completeness Proofs

In general, most `NP`-completeness proof will make use of the following lemma:

Lemma. If some `NP`-complete language reduces to a language `L`, then `L` is `NP`-hard. If `L` is further in `NP` then `L` will be NP-complete.

Proof. Just compose the reductions.

Some NP-complete Problems

Let SAT = `{langle F rangle | langle F rangle` is a satisfiable boolean formula `}`
Let 3SAT = `{langle F rangle | langle F rangle` is a satisfiable CNF formula where each clause has at most three literal `}`.

Theorem. Both SAT and 3SAT are `NP`-complete.

Proof. First both languages are in `NP` by the same argument that showed CIRCUIT-SAT in `NP`. Given an instance `langle C rangle` of CIRCUIT-SAT, let gate `i` be coded as `langle i, type, j, k rangle`. Here type is AND, OR, NOT, or input, and `j, k < i` are gates which are inputs to this gate. A `0` for `j` or `k` means that argument is not used. Let `c_i`'s be new variables other than the input variables `x_j`. Recall the symbol `<=>` is true if both its boolean inputs have the same value. For each gate we create a boolean formula either of the form `c_i <=> (c_j mbox( type ) c_k)`, where type is replaced with AND or OR; or of the form `c_i <=> (mbox( type ) c_j)` in the case of NOT or an input (in the input case type is nothing). The SAT formula we output on input `langle C rangle` is the conjunction of all such defining formulas conjuncted with `c_w`, where `w` is the last gate in the formula. The idea is if `c_w` is true, then its defining equation `c_w <=>...` must be true and this propagates back to some setting of the leaves which will make the circuit true. By rewriting each `c_i <=> (c_j mbox( type ) c_k)` formulas in 3CNF we can make this whole formula into 3CNF. We can pad clauses with less than 3 literals with dummy variables to make all clauses the same size.

NP, NP-Completeness, Reductions

Outline