Introduction

Last week, we were talking about the classes `P` and `NP` and about `NP`-completeness.
We said language was in `P` if they have a polynomial time decision procedure (i.e., an algorithm that run in `p`-time on an input size that outputs yes if in language no if not )
We said language `L` was in `NP` if it has a polynomial time verification algorithm `A(x,y)` and polynomial `q` such that `x in L`, iff there exists `y` such that `|y| leq q(|x|)` and `A(x,y)` outputs 1.
We said a language `L` is `p`-time reducible to another language `L'`, written `L leq_p L'`, if there is a polynomial time computable map `f` from strings to string such that `x in L` iff `f(x) in L'`.
A language `L` is said to be `NP`-complete if it is in `NP` and any other language in `NP` is `p`-time reductible to `L`.
We exhibited several `NP`-complete languages last week: CIRCUIT-SAT, SAT, 3SAT, CLIQUE, VERTEX-COVER, HAM-CYCLE, TSP.
Today, we exhibit one more before moving on to approximation algorithms: SUBSET-SUM.

SUBSET-SUM

In the subset-sum problem we are given a finite set `S subset NN` and a target `t in NN`. We then ask is there a subset `S' subseteq S` whose elements sum to `t`?
For example, if `S = {1,2,7, 14, 49}` and `t=16`, then the subset `S' = {2, 14}` is a solution.
Formally,
SUBSET-SUM = `{langle S, t rangle | exists S' subseteq S, t= sum_(s in S') s}`.
We assume in the framing of this problem that we are encoding the numbers in binary.

NP-Completeness of SUBSET-SUM

Theorem. SUBSET-SUM is `NP`-complete.

Proof. To see SUBSET-SUM is in `NP` notice if we are given an instance `langle S, t rangle` of subset sum and a particular encoding of set of integers `langle S' rangle`, by linear scan for each element of `S'` we can check if it is in `S`. Further, by another scan of `S'` we can compute the sum of the elements in `S'` and then check if they are equal to `t`. This whole procedure would take at most` O(|langle S, t rangle +langle S' rangle|^2)` and so is a polynomial time verification procedure for SUBSET-SUM.

To show SUBSET-SUM is `NP`-hard, i.e., any language in `NP` reduces to it, it suffices to reduce 3SAT to SUBSET-SUM, as we already showed 3SAT is `NP`-complete. Suppose `phi(x_1, ..., x_n)` is an instance of 3SAT with clauses `C_1, ..., C_k`. WLOG, we can assume each clause has exactly three distinct literal, no clause has both a literal and its negation, and each variable appears in at least one clause.

The reduction creates two numbers in set `S` for each `x_i` and two numbers in `S` for each `C_j`. Numbers will be created in base 10, where each number contains `n + k` digits and each digit corresponds to either one variable or one clause. Proof continues next slide...

NP-Completeness of SUBSET-SUM cont'd

As we can see from the above picture we construct `S` and `t` by labeling each digit position by either a variable or a clause. The least significant `k` digits are labeled by clauses, and the most significant `n` digits are labeled by variables. In the picture above `phi = C_1 ^^ C_2 ^^ C_3 ^^ C_4`, where `C_1 = (x_1 vv neg x_2 vv neg x_3)`, `C_2 = (neg x_1 vv neg x_2 vv neg x_3)`, `C_3 = (neg x_1 vv neg x_2 vv x_3)`, and `C_4 = (x_1 vv x_2 vv x_3)`.

The target `t` has a `1` in each digit labeled by a variable and a `4` in each digit labeled by a clause.
For each variable `x_i`, there are two integers `v_i` and `v'_i` in `S`. Each has a `1` in the digit labeled by `x_i` and `0`'s in the other variable digits. If literal `x_i` appears in clause `C_j`, then the digit labeled by `C_j` in `v_i` contains a `1`. If literal `neg x_i` appears in clause `C_j`, then the digit labeled by `C_j` in `v'_i` contains a `1`. All other digits are labeled `0`.
For each clause `C_j`, there are two integers `s_j` and `s'_j` in `S`, Each has `0`'s in all digits other than the one labeled by `C_j`. For `s_j`, there is a `1` in the `C_j` digit, and `s'_j` has a `2` in this digit.

NP-Completeness of SUBSET-SUM Cont'd Some More

The maximum sum of digits in any digit position is at most `6`, so we don't have to worry above carries when we add numbers in `S`. `S` contains `2n + 2k` values each with `n +k` digits, where the time to produce a digit is polynomial in `n+k`, each digit of the target can be computed in constant time, so the whole reduction from `phi` to the `S` described above is `p`-time.

Suppose `phi` is satisfiable. If `x_i = 1` in this assignment include `v_i` in `S'`; otherwise include `v'_i` in `S'`. The sum of the `x_i` digit positions in `S'` will be `1` as we are only including one of the two `v_i`'s and all other `v_j`'s have `0` in the `x_i` digit position.

If we sum a `C_j` digit position from the elements so far added to `S'` we would get either `1`, `2`, or `3` depending on how many variables in the assignment satisfy this clauses. We can then add either `s_j` or `s'_j` or both to `S'` to ensure we get a sum of `4`. Hence, we have shown there exists an `S'` which achieves the target.

Suppose we have have constructed `S` and `t` as above, and there is an `S'` that achieves the target sum. Then to satisfy the `x_i` digit columns we must have exactly one of `v_i` or `v'_i` in `S'`. From which we can get an assignment for `phi`. The fact that the `C_j` column sum targets were achieved will ensure this is a satisfying assignment.

Quiz

Which of the following is true?

CIRCUIT-SAT for circuits using NAND gates rather than AND, OR, NOT gates is no longer NP-complete.
Our reduction from last week of CLIQUE to VERTEX-COVER introduced a quadratic blow up in the size of the output VERTEX COVER graph as compared to the input instance of CLIQUE.
Our reduction from VERTEX COVER to HAM-CYCLE made use of widgets with 12 nodes that we used for every edge in the original graph.

Approximation Algorithms, Performance Ratios

Since it seems hard to find exact solutions to the optimization problems associated with a given `NP`-complete problem, it is natural to ask if one can get approximate solutions in polynomial time?
We say an algorithm for a problem has an approximation ratio of `r(n)`, if for any input of size `n`, the cost `C` of the solution produced by the algorithm is within a factor of `r(n)` of the cost `C^star` of the optimal solution. That is, `max(C/C^star, C^star/C) le r(n)`.
We call an algorithm that achieves an `r(n)`-approximation ratio an `r(n)`-approximation algorithm.
Some `NP`-complete problems have a trade-off between the approximation ratio and the run time.
An approximation scheme for an optimization problem is an algorithm that takes both an instance of the problem as well as a constant `epsilon` and then runs a `(1 + epsilon)`-approximation on the instance.
If for any `epsilon`, the approximation scheme run in `p`-time, then it is called a polynomial time approximation scheme.
We say that an approximation scheme is a fully `p`-time approximation scheme if it is an approximation scheme and its run time is `p`-time in both `1/epsilon` and the instance size `n`. For example, the scheme might have a running time of `O((1/epsilon)^2n^3)`.

The Vertex Cover Problem

The optimization problem associated with VERTEX-COVER is to find the least vertex cover of a instance graph `G`.
The following algorithm takes a graph `G` and outputs a vertex cover within twice the optimal.

APPROX-VERTEX-COVER(G)
1 C=∅
2 E'= E[G]
3 while E' ≠ ∅
4    let {u, v} be an arbitrary edge of E'
5    C = C ∪ {u, v}
6    Remove from E' every edge incident with either u or v
7 return C.

Analysis of APPROX-VERTEX-COVER

Theorem. APPROX-VERTEX-COVER is a p-time 2-approximation algorithm.

Proof. First, the algorithm runs in time `O(|V| +|E|)`, as we delete two vertices and at least one edge each time through the loop.

The set `C` returned by the algorithm is a vertex cover, since each edge that is removed is covered by some vertex in `C`. And the loop continues till no edges left.

To see that the cover returned is at most twice the optimal, let `A` denote the set of edges which were picked in line 4. In order to cover the edges in `A`, any vertex cover (including the optimal `C^star`) must include at least one endpoint of each edge in `A`. No two edges in `A` share an endpoint, so no two edge from `A` are covered by the same vertex from `C^star`. So `|C^star | ge |A|`. On the other hand `|C| = 2|A|`.

Approximating the Traveling Salesman Problem

The optimization problem associated with TSP is to find a tour of least cost.
Here is a 2-approximation algorithm for this problem when the triangle inequality holds on the distances between cities.

APPROX-TSP-TOUR(G, c)
1. Select a vertex r to be a root vertex
2. Compute the minimal spanning tree for G from root r using Prim's algorithm
3. Let L be the list of vertices visited in a pre-order tree walk of T
4. return the Hamiltonian cycle H that visits the vertices in order L.

Subroutines used by our algorithm

Recall in a pre-order traversal of a graph starting from some node, we visit each child we have not yet visited, and then visit the current node.
Recall Prims algorithm contructs a minimal spanning tree from a tree so far, denoted `A`, which at the start of the algorithm is the empty tree.
We maintain a priority queue of all the vertices not in A.
The priority, `v.key`, for a vertex `v` in the queue is the least weight of any edge connecting `v` with `A`. If no such edge exists than it is `infty`.
Let `v.pi` be the parent of `v` in the tree. Rather than explicitly have an `A` we use this parent structure to get the tree when the algorithm terminates.

Here is the pseudo-code:

MST-PRIM(G, w, r) // r is a starting node to grow the tree from
01 for each u in G.V
02    u.key = infty
03    u.pi = NIL
04 r.key = 0
05 r.pi = 0;
06 Q = MAKE-QUEUE(G.V) //will have all vertices
07 while Q != 0
08     u = EXTRACT-MIN(Q)
09     for each v in G.adj[u]
10        if v in Q  and u.key + w(u, v) < v.key
11            v.pi = u
12            v.key = u.key + w(u,v) //call appropriate DECREASE-KEY

Analysis of APPROX-TSP-TOUR

Theorem. APPROX-TSP-TOUR is a p-time 2-approximation algorithm for TSP with triangle-inequality holding on the cost function.

Proof. The minimal spanning tree algorithm runs in time `O(|V|^2)`. The remaining step take at most `O(|G|)` time.

Let `H^star` denote the optimal tour of the vertices. Since we can obtain a spanning tree from any tour by deleting an edge, we have `c(T) le c(H^star)` where `T` is our minimal spanning tree. A full walk `F` of `T` lists the vertices when they are first visited and also whenever they are returned to after a visit to a subtree. So `c(F) = 2c(T) le 2c(H^star)`. A full walk is typically not a tour since it lists some vertices twice.

On the other, the `H` returned by the algorithm is a tour and satisfies `c(H) le c(F)`, since it is obtained by deleting vertices from the full walk and since the triangle inequality holds. We are using the triangle inequality as if we have a sequence `a b c` in the full walk and delete `b`, our tour we want that the cost does not rise.

Subset-Sum, Start Approximation Algorithms

Outline