CS 577 SECTION 1 
PRE-MIDTERM REVIEW GUIDE
Fall 2019

1. Useful math

   Asymptotic relations: O,Omega,Theta,~,o

   Run times can be polynomial, exponential, or in between
   Pseudopolynomial algorithms (= poly time for inputs in unary)

   Mathematical induction (for correctness proofs AND alg analysis)

   Summing by levels to estimate solution to a recurrence

2. Divide and conquer, recursion

   Hallmark: Rapid reduction in problem size, usually by a constant
             factor.

   Exponentiation
   Mergesort -- "oblivious" generation of subproblems
   Inversion counting -- example of piggybacking
   Closest pair in the plane -- subproblems depend on data
   Efficient integer arithmetic (works for polynomials too)
            "School" algorithms
            Karatsuba multiplication
            Newton iteration (reduces division to multiplication)
   Strassen matrix multiplication
   Fast Fourier transform as a remainder tree
            Applications: polynomial and integer multiplication
                          convolutions
                          listing sums and inner products

3. Greedy algorithms

   Hallmark: Irrevocable choices, based on (usually numerical) priority

   Review of graph theory (BFS and DFS, analogous to marching thru an array)
   Problems solvable in linear time: components, 2-coloring, cycle detection,
                                     independent cycles
   Dijkstra single source shortest path with heaps
   Kruskal minimum spanning tree with union-find trees (exchange principle)
   Unit-weight interval scheduling (greedy stays ahead) 
   EDD rule to minimize maximum tardiness (correctness uses inversions)

3. Dynamic programming
  
   Hallmark: Systematic solution of subproblems

   Bellman equations and the Principle of Optimality (different for
           each problem)

   Fermat least-time principle (basically shortest path)
   Weighted interval scheduling
   Segmented least squares

   DP applied to computation: Fibonacci numbers, subset sum problem

CS 577 SECTION 1
POST-MIDTERM REVIEW GUIDE
FALL 2019

Dynamic Programming Examples

   Math ideas: induction on structures (arrays, trees, etc.)
               polynomial vs. pseudopolynomial algorithms

   Fibonacci numbers
   Subset sum
   Knapsack
   Vertex cover for trees
   Longest increasing subsequence and patience sorting
   Edit distance
   Bellman-Ford shortest path algorithm (allows for negative edges)

Randomization

   Probability ideas:
     Independence (probabilities multiply)
     Expected value (similar to an integral)
     Linearity of expectation
     Waiting time for success
     Conditional probability

   Contention resolution
   Global min-cut
   Quicksort and selection by rank
   Rabin-Karp string matching

Network Flow

   Linear and integer programming (just the setup, no algorithms)
   Flow network model: Flows and cuts, flow value <= cut capacity
   Failure of greedy algorithm
   Residual graphs
   Ford-Fulkerson procedure 
        pseudopolynomial if capacities are integral
   Max-flow / Min-cut theorem (strong duality)
   Applications
       Unit capacity networks and bipartite matching
       Baseball elimination
       Project selection
       
Lower Bound Techniques

   Decision trees and leaf-counting ("information theory") bounds
   Adversary arguments

NP-Completeness

   Decision problem classes: P, NP (defined using certificates)
   Polynomial-time reductions (multiple queries allowed)
   NP-complete: i) in NP, ii) all other NP problems reduce to it
                if ii) only, problem is called NP-hard
   Cook-Levin theorem: Boolean formula satisfiability is NP-complete
   Satisfiability variations: unrestricted formulas, circuits,
                              conjunctive normal form (CNF)
   Graph problems: 
        Vertex Cover (covering) 
        Clique and Independent Set (packing)
   3-dimensional Matching
   Subset Sum

Dynamic programming

Subset sum

Given a set of non-negative integers, and a value sum, determine if there is a subset of the given set with sum equal to given sum.

def subset_sum(x_list,target_sum):
  # if target_sum = 0, return true
  if target_sum==0:
    return True
  
  
  # create a len(x_list) x target_sum matrix
  res = [[0 for j in range(target_sum)] for i in range(len(x_list)) ]
  # the [i,j]th element of the matrix represent whether it is possible to get the value j from any subset of the sublist x_list[:i]
  
  # initial condition
  for j in range(1,target_sum+1):
    if x_list[0]==j:
      res[0][j-1] = 1
  
  # the Bellman equation can be written as 
  for i in range(1,len(x_list)):
    for j in range(1,target_sum+1):
   		# if the current element is larger than the current target, we do not select it
      if x_list[i]>j:
        res[i][j-1] = res[i-1][j-1]
      # if equal, res = 1
      elif x_list[i]==j:
        res[i][j-1] = 1
      # if the current element is smaller than the current target, we need to discuss whether include this element or not can achieve the current target
      else:
        # does not include it
      	if res[i-1][j-1]==1:
        	res[i][j-1] = 1
        # include it
      	else:
        	res[i][j-1] = res[i-1][j-1-x_list[i]]
  # see the dynamic programming results
  print(res)
  
  # output the result, True or Fals
  if res[len(x_list)-1][target_sum-1]==1:
    return True
  else:
  	return False

# test the function
print(subset_sum([1,2,3,4],10))

Knapsack problem

Given weights and values of n items, put these items in a knapsack of capacity W to get the maximum total value in the knapsack. In other words, given two integer arrays val[0..n-1] and wt[0..n-1] which represent values and weights associated with n items respectively. Also given an integer W which represents knapsack capacity, find out the maximum value subset of val[] such that sum of the weights of this subset is smaller than or equal to W.

def knapsack(weights,values,max_weight):

create a len(weights) x max_weight matrix, row represents weights, column represents the current maximum capacity

res = [[0 for j in range(max_weight)] for i in range(len(weights))]
for j in range(max_weight):
if (weights[0]<j+1) or (weights[0]==j+1):
res[0][j] = values[0]
for i in range(1,len(weights)):
for j in range(max_weight):
if weights[i]>j+1:
res[i][j] = res[i-1][j]
elif weights[i]==j+1:
if values[i]>res[i-1][j]:
res[i][j] = values[i]
else:
res[i][j] = max(res[i-1][j],res[i-1][j-weights[i]]+values[i])
return res[-1][-1]

print(knapsack([5,4,6,2],[3,2,6,1],16))




### Vertex cover for trees

最小的可以把一颗树所有edge全部cover的一种vertex组合.

对每一个vertex $i$, $dp[i][0]$表示把这个节点看做root节点，并把这个vertex放入最后的结果集里面以后这颗树的最小value, $dp[i][1]$表示不取这个vertex放入最后的结果集。那么bellman function就是
$$
dp[i][0] = 1+ \sum min_{parent[u]=i}(dp[u][0],dp[u][1])
$$

$$
dp[i][1] = \sum dp[u][0]
$$

For the first formula, it says that when we take the root node into the result set, its children can decide whether it goes to the result set or not.

For the second formula, it says that when we do not take the root node into the result set, its children must go to the result set.



### Longest increasing subsequence

```python
def longest_increasing_subsequence(x_list):
	res = [1 for i in range(len(x_list))]
  for i in range(len(x_list)):
    for j in range(i):
      if x_list[i]>x_list[j]:
      res[i] = max(res[i],res[j]+1)
  return max(res)

print(longest_increasing_subsequence([1,5,2,3,7,6]))

Randomized algorithm

Randomized median finding

The problem is to find the median of a list of n elements.

Natural approach is to select a pivot element p, to compute its rank, partition the list into 2 sublists, restrict our search to one of the lists, discard the other, and proceed recursively.

The worst case happens when we discard very few elements in each recursion, this may lead to $cn^2$ comparisons for some c.

Randomized approach is to select an element uniformly at random among all elements in the list. The expectation of this algorithm performs at most cn comparisons for some appropriate c.

Let $f_{n}$ be the number of comparisons this algorithm performs on an input of n elements. We want to compute $E(f_n)$.

Let $l$ be the random variable of the size of the sublist with elements smaller than p, the list we keep will have size $max(l,n-1-l)$.

Let $k$ be a random variable denoting the length of the list we keep after the first pivot. Since both $l=1$ and $l=n-1$ would result in keeping a sublist of length $k=n-1$, the prob that the list we keep has length $n-1$ is $P(k=n-1)=\frac{2}{n}$, More generally, we have that $\mathbb{P}(k=i)=\frac{2}{n}$ for $i=\frac{n}{2}, \frac{n}{2}+1, \cdots, n-2, n-1$ .

To compute $E(f_n)$, we introduce the event $A_k$ that the sublist we keep after the first level has size $k$. These events are disjoint. So $f_{n}=f_{n}\left(I_{A_{n / 2}}+I_{A_{n / 2+1}}+\cdots+I_{A_{n-1}}\right)$, and $\mathbb{E}\left(f_{n}\right)=\mathbb{E}\left(f_{n} I_{A_{n / 2}}\right)+\mathbb{E}\left(f_{n} I_{A_{n / 2+1}}\right)+\cdots+\mathbb{E}\left(f_{n} I_{A_{n-1}}\right)$

For one term $E(f_nI_{A_k})$, we have
$$
\begin{aligned} \mathbb{E}\left(f_{n} I_{A_{k}}\right) &=\mathbb{E}\left(\left(n-1+f_{k}\right) I_{A_{k}}\right)=\mathbb{E}\left((n-1) I_{A_{K}}\right)+\mathbb{E}\left(f_{k} I_{A_{k}}\right) \ &=(n-1) \mathbb{E}\left(I_{A_{k}}\right)+\mathbb{E}\left(f_{k}\right) \mathbb{E}\left(I_{A_{k}}\right) \ &=\left((n-1)+\mathbb{E}\left(f_{k}\right)\right) \mathbb{E}\left(I_{A_{k}}\right) \end{aligned}
$$
第一行是因为每选择一个pivot point，都需要做和剩余元素做n-1次比较，然后剩下要做的比较次数=$f_k$, 即$f_n = n-1 + f_k$

第二行是因为$f_k$ 和 $A_k$ 独立，因为$f_k$ 只和你第二步的pivot选取有关，而$A_k$ 其实是取决于你第一步的pivot选取。

Using $\mathbb{E}\left(I_{A_{k}}\right)=\mathbb{P}\left(A_{k}\right)$ and summing over k, we get
$$
\begin{aligned} \mathbb{E}\left(f_{n}\right) &=\sum_{k=n / 2}^{n-1}\left(n-1+\mathbb{E}\left(f_{k}\right)\right) \mathbb{P}\left(A_{k}\right) \ &=\left((n-1) \sum_{k=n / 2}^{n-1} \mathbb{P}\left(A_{k}\right)\right)+\sum_{k=n / 2}^{n-1} \mathbb{E}\left(f_{k}\right) \mathbb{P}\left(A_{k}\right) \ &=(n-1)+\frac{2}{n} \sum_{k=n / 2}^{n-1} \mathbb{E}\left(f_{k}\right) \end{aligned}
$$

Prove by induction that $\mathbb{E}\left(f_{n}\right) \leq c n$ , let $\mathbb{E}\left(f_{m}\right) \leq c m \text { for all } m<n$ .
$$
\begin{aligned} \mathbb{E}\left(f_{n}\right) &=(n-1)+\frac{2}{n} \sum_{k=n / 2}^{n-1} \mathbb{E}\left(f_{k}\right) \ & \leq(n-1)+\frac{2}{n} c \sum_{k=n / 2}^{n-1} k \ &<(n-1)+c \frac{3}{4} n \ &<\left(1+\frac{3 c}{4}\right) n \ & \leq c n \end{aligned}
$$
provided that we chosse $c\geq4$. This proves by induction that $\mathbb{E}\left(f_{n}\right) \leq 4 n$.

Randomized quick sort

Original quick sort: always take the first or the last element as the pivot point.

Randomized algorithm randomly select one element in the list as the pivot point.

Let $f_j$ be the number of comparison Quicksort takes to sort $j$ items, we then get

$$
f(n)=n-1+f(k)+f(n-k-1)
$$
The best-case running time for quicksort is when the pivot is always in the exact middle of the list.
$$
f(n) \leq 2 f(n / 2)+n-1
$$

If we let $f$ be the random variable which gives the amount of time taken by the algorithm on an input of size n, then we have:
$$
\mathbb{E}(f(n))=n-1+\mathbb{E}(f(k))+\mathbb{E}(f(n-k-1))
$$

Let T(n) be the expected number of comparisons, we have:
$$
T(n)=(n-1)+\frac{1}{n} \sum_{i=0}^{n-1}(T(i)+T(n-i-1))
$$

$$
T(n)=(n-1)+\frac{2}{n} \sum_{i=1}^{n-1} T(i)
$$

Now, we can solve this by the “guess and prove inductively” method. In order to do this, we first need a good guess. We guess a form of $cnln(n)$ for some constant $c$. Once we’ve made our guess, we will need to evaluate the resulting summation. One of the easiest ways of doing this is to upper-bound the sum by an integral. In particular if $f(x)$ is an increasing function, then:
$$
\sum_{i=1}^{n-1} f(i) \leq \int_{1}^{n} f(x) d x
$$
In our case, we will be using the fact that $\int(c x \ln x) d x=(c / 2) x^{2} \ln x-c x^{2} / 4$ .

Now, we are guessing that $T(i) \leq c i \ln i \text { for } i \leq n-1$. This guess works for the base case T(1) = 0. By induction we have:
$$
\begin{aligned} T(n) & \leq(n-1)+\frac{2}{n} \sum_{i=1}^{n-1}(c i \ln i) \ & \leq(n-1)+\frac{2}{n} \int_{1}^{n}(c x \ln x) d x \ & \leq(n-1)+\frac{2}{n}\left((c / 2) n^{2} \ln n-c n^{2} / 4+c / 4\right) \ & \leq \operatorname{cn} \ln n, \text { for } c=2 \end{aligned}
$$
In terms of the number of comparison it makes, randomized quicksort is equivalent to randomly shuffling the input and then handing it off to basic quicksort. So we have also proven that basic quicksort has $O(n \log n)$ avergae-case running time.

Rabin-Karp string matching algorithm

$$
\begin{array}{l}{\text { Text : A A B A A C A A D A A B A A B A }} \ {\text { Pattern : A A B A }}\\end{array}
$$

give each letter a value, and the pattern has a hash value (sum all values of letters) , then use the window to compute the hash value for the subset of the text, if equal, examine the window(subset) with the pattern elementwise.

Strong hash function is needed which allows distinct hash value. For example: use $a_1\times 10^3 + a_2\times 10^2 + a_3 \times 10^1 + a_4\times 10^0$

If the value is too large, you can apply mod.

Analysis of contraction algorithm

To solve the global minimum cut problem

Cut: given an undirected graph $G=(V, E)$, we define a cut of $G$ to be a partition of $V$ into non-empty sets A and B.

s-t cut: $G$ is with distinguished source and sink nodes s and t, and s-t cut was defined to be a partition of V into sets A and B such that $s\in A$ and $t \in B$.

size of cut (A,B): number of edges with one end in A and the other in B.

global minimum cut: a cut of minimum size, and any cut is allowed.

polynomial-time algorithm can solve global min-cut problem: convert it into s-t cut problem, fix one s , then compute the minimum s-t cut in $V-{s}$, then find the global minimum. (From the known fact that finding a minimum s-t cut is doable.)

The Contraction Algorithm

beigin by choosing an edge $e=(u, v)$ of G uniformly at random an contracting it. (把u和v合并成w，删掉所有以前在u和v之间的edge，并把以前从u或者v出去或者指向u，v的edge全部连到w上去).

continue recursively until there only remains two sets S(v1) and S(v2) which form a partition of V. We output (S(v1), S(v2)) as the cut found by the algorithm.

analysis: It looks like making random choices, but there is some probability that it will succeed in finding a global min-cut.

the algorithm returns a global min-cut of G with probability at least $1 /\left(\begin{array}{l}{n} \ {2}\end{array}\right)$. where n is the number of vertexs.

Proof:

Suppose there is a global min-cut (A,B) of G and suppose it has size k, which implies that there is a set F of k edges with one end in A and the other in B. We want to give a lower bound on the probability that the Contraction Algorithm returns the cut (A,B).

如果algorithm没能选到cut(A,B),它一定是把在F里面的edge也contract掉了。所以下一步我们需要一个upper bound 来估计这个算法把F里的edge contradict掉的概率，这个upper bound会决定算法成功的lower bound。

So, what is the upper bound on the probability that an edge in F is contracted? For this, we need a lower bound on the size of the number of total edges. Notice that if any node $v$ had degree less than $k$, then the cut $({v}, V-{v})$ would have size less than k, contradicting our assumption that (A,B) is a global min-cut. Thus every node in G has degree at least k, and so the total number of edges |E| satisfy $|E|\geq\frac{1}{2}kn$. Hence the probability that an edge in F is contracted is at most $\frac{k}{\frac{1}{2} k n}=\frac{2}{n}$ .

Now consider the situation after j iterations, when there are n-j super nodes in the current graph $G^{‘}$ , and suppose that no edge in F has been contracted yet. Every cut of $G^{‘}$ is a cut of G, and so there are at least k edges incident to every supernode of $G^{‘}$. Thus $G^{‘}$ has at least $\frac{1}{2} k(n-j)$ edges, and so the probability that an dege of F is contracted in the next iteration j+1 is at most $1/2\times k(n-j)$.

The cut (A,B) will actually be returned by the algorithm if no edge of F is contracted in any of iterations 1,2, …, n-2. Write $\varepsilon_{j}$ for the event that an edge of F is not contracted in iteration j.

Then $\operatorname{Pr}\left[\varepsilon_{1}\right] \geq 1-2 / n$ and $\operatorname{Pr}\left[\varepsilon_{j+1} | \varepsilon_{1} \cap \varepsilon_{2} \cdots \cap \varepsilon_{j}\right] \geq 1-2 /(n-j)$

We are interested in bounding $\operatorname{Pr}\left[\varepsilon_{1} \cap \varepsilon_{2} \cdots \cap \varepsilon_{n-2}\right]$, which is
$$
\begin{array}{l}{\operatorname{Pr}\left[\mathcal{E}{1}\right] \cdot \operatorname{Pr}\left[\varepsilon{2} | \varepsilon_{1}\right] \cdots \operatorname{Pr}\left[\varepsilon_{j+1} | \varepsilon_{1} \cap \varepsilon_{2} \cdots n \varepsilon_{j}\right] \cdots \operatorname{Pr}\left[\varepsilon_{n-2} | \varepsilon_{1} \cap \varepsilon_{2} \cdots n \varepsilon_{n-3}\right]} \ {\quad \geq\left(1-\frac{2}{n}\right)\left(1-\frac{2}{n-1}\right) \cdots\left(1-\frac{2}{n-j}\right) \cdots\left(1-\frac{2}{3}\right)} \ {\quad=\left(\frac{n-2}{n}\right)\left(\frac{n-3}{n-1}\right)\left(\frac{n-4}{n-2}\right) \cdots\left(\frac{2}{4}\right)\left(\frac{1}{3}\right)} \ {\quad=\frac{2}{n(n-1)}=\left(\begin{array}{c}{n} \ {2}\end{array}\right)^{-1}}\end{array}
$$

So we now know that a single run of the Contraction Algorithm fails to find a global minimum-cut with probability at most (n choose 2)^(-1).

This number is very close to 1, of course, but we can amplify our probability of success simply by repeatedly running the algorithm, with independent random choices, and taking the best cut we find. By fact (13.1), if we run the algorithm $\left(\begin{array}{l}{n} \ {2}\end{array}\right)$ times, then the probability that we fail to find a global minimum-cut in any run is at most:
$$
\left(1-1 /\left(\begin{array}{l}{n} \ {2}\end{array}\right)\right)

\left(1-1 /\left(\begin{array}{l}{n} \ {2}\end{array}\right)\right)^{\left(\begin{array}{l}{n} \ {2}\end{array}\right)} \leq \frac{1}{e}
$$

And it’s easy to drive the failure probability below 1/e with further repetitions, If we run the algorithm $\left(\begin{array}{l}{n} \ {2}\end{array}\right)ln n$ times, then the probability we fail to find a global min-cut is at most

$e^{-\ln n}=1 / n$.

Netflow Problems

The Maximum-Flow Problem and the Ford-Fulkerson Algorithm

Problem description:

s is the source node, t is the sink node, we want to find the maximum flow which can be passed through s to t. (numbers on edges are the maximum flow an edge can stand).

Main theorem

The main theorem links the maximum flow through a network with the minimum cut of the network.

Max-flow min-cut theorem. The maximum value of an s-t flow is equal to the minimum capacity over all s-t cuts.

Ford Fulkerson algorithm by example

Initially, we have

select augmenting path: we can select non-full forward edge and non-zero backward edge.

first, we select S-A-D-T. The bottle neck capacity equals 8.

Then, we select S-A-D-T, the bottle neck capacity is 2.

Then we select S-C-D-A-B-T (because D-A is non-zero), the bottle neck capacity is 4.

Then we consider S-A-D-B-T , the bottle neck capacity is 2.

Then we consider S-C-D-B-T, the bottleneck capacity is 3.

Till this step, we can see, no augmenting path satisfy the condition that non-full forward edge and non-zero backward edge. We are finished.

The time complexity of this algorithm is O(max_flow * E).

Bipartite Graphs and Matchings

Definition of bipartite graph

A graph G = (V, E) is called bipartite if there is a partition of V into two disjoint subsets: V = L ∪ R, such every edge e ∈ E joins some vertex in L to some vertex in R。

Definition of matchings and perfect matchings

Let G = (V, E) be a graph. A matching in G is a set of edges M ⊆ E such that for every e, e0 ∈ M, there is no vertex v such that e and e0 are both incident on v.

The matching M is called perfect if for every v ∈ V , there is some e ∈ M which is incident on v. （每一个vertex都仅仅被一条edge覆盖)

Definition of neighbors

For a set of vertices $S \subseteq V$, we define its set of neighbors $\Gamma(S)$ by:
$$
\Gamma(S)={v \in V | \exists u \in S \text { s.t. }{u, v} \in E}
$$

Get a characterization of when a bipartite graph has a perfect matching

Suppose $G=(L, R, E)$ has a perfect matching M. Then for every set $S \subseteq L$, we have that $|\Gamma(S)| \geq|\Gamma(S) \cap M| \geq|S|$ (这是因为

)

Theorem 4 (Hall’s Marriage Theorem).

Let $G=(L, R, E)$ be a bipartite graph with $|L|=|R|$. Suppose that for every $S \subseteq L$, we have $|\Gamma(S)| \geq|S|$. Then $G$ has a perfect matching.

Hall’s condition: for every $S \subseteq L$, we have $|\Gamma(S)| \geq|S|$.

Proof:

这里证明的思路是我们假设这个theorem对所有edge总数小于m的并满足Hall’s条件的图，都有一个perfect match，那么对于我们现在手上这个edge总数=m的图，怎么样把它拆解呢？是有办法的。最naive的想法就是任取一条edge，并把它加入到perfect match中去，然后移除这个edge所接上的两个vertex（一个在左，一个在右），也移除这个edge。那么剩下的图就只有m-1个edges了呀，我们的induction结论就可以用了？不，由于移除两个vertex有可能破坏Hall’s condition,比如

在这个图中，本来是满足hall’s condition的，但如果移除了这条edge和对应的vertex，显然剩下的图就不满足hall’s condition了, 怎么办呢？加条件啊，这才有了后面证明中的两个case。case1 其实就完美规避了去掉edge和vertex后不满足hall’s condition 的情况。

By induction on $|E|$. Let $|E|=m$. Suppose we know the theorem for all bipartite graphs with $< m $ edges.

We take cases depending on whether there is slack in the hypothesis or not

Case 1: For every $S \subseteq L$ with $0<|S|<|L|$, we have $|\Gamma(S)| \geq|S|+1$，In this case, pick any edge e = {u, v} ∈ E and include it in the matching M. Apply the induction hypothesis to the induced bipartite graph on L \ {u} and R \ {v}: this gives us a matching M0 between L \ {u} and R \ {v}. The desired matching between L and R is M0 ∪ {e}.

Case2 是 case 1 的补集，因为整个图首先是满足hall’s condition的

Case 2:

For some S ⊆ L with 0 < |S| < |L|, we have |Γ(S)| = |S|. In this case, first apply the induction hypothesis to the the induced bipartite graph on S and Γ(S). This gives us a matching M0 between S and Γ(S).

Now let T = L \ S and U = R \ Γ(S). Applying the induction hypothesis again, we get that the induced bipartite graph on T and U has a perfect matching M00 (Why? Suppose there was some S 0 ⊆ T such that |Γ(S 0 ) ∩ U| < |S 0 |. Then Γ(S ∪ S 0 ) = |Γ(S 0 ) ∩ U| + |Γ(S)| < |S 0 | + |S| = |S 0 ∪ S|, a contradiction).

这里的反证法大概意思除去|Γ(S)| = |S|的那些，剩下的也肯定满足hall condition。

The desired matching between L and R is M0 ∪ M00

Prove that a k-regular bipartite graph (with k$\geq$ 1) has a perfect matching

就是每个vertex有exactly k条edge。

Using Hall’s Theorem: Hall’s Theorem tells us there is matching that matches A if |N(S)| ≥ |S| for all S ⊂ A. The number of edges from S to N(S) is k|S|, and this is the most the number of edges from N(S) to A, which is k|N(S)|. So we have k|S| ≤ k|N(S)|, which implies |N(S)| ≥ |S|.

Use generalized hall’s theorem: 不用考虑左右两边的vertex多少，只要整个图满足hall’s condition，我们就能找到一个match把左边给盖掉。那么这样一来，右边的元素显然是要比左边多的。

如果我们在左边用一下hall’s theorem, 发现成立，证明右边元素大于等于左边。又在右边用一下hall’s theorem，发现成立，那么证明了左边元素数量又大于右边，于是左右两边的元素数量只能相等，那么任意一个match左边所有元素的match其实都是一个perfect match，完事。

Selection and adversary arguments

Lower bound for finding the max and min number in a list

A theorem states that any algorithm which can find max and min of the list with $n$ numbers should do at least $3n/2-2$ comparisons in the worst case.

We assume the elements in the list are distinct. If we want to know a key $x$ is max and a key $y$ is min, the algorithm should know that every other key other than x has lost some comparisons and every other key other than y has won some comparison. Then an algorithm should at least have 2n-2 units of information to be sure to give the right answer.

我们要证明通过3n/2-2次比较得到的信息量一定大于等于 2n-2 units. 为此我们需要定义每一次比较最坏能获得的信息量的大小。

如果一个key还没有参与比较，他的status是N

如果一个key已经输过了，他的status是L

如果一个key输过也赢过，他的status是WL

很容易知道，从N到W或者L，信息量都会增加1

从W或者L变成WL，信息量也会增加1

所以最坏的情况下，每次比较两个key，都往不会增加信息量的方向发展，比如

稍微解释一下：

两个没比较过的去比较，一定有赢有输，信息量增加2

一个赢的和一个输的去比，最差的情况就是赢的还是赢，输的还是输，信息量不变

等等……

To complete the proof, we only need to show that the algorithm should at least do 3n/2-2 comparisons to gain 2n-2 units of information. Why?

First, the only case that the algorithm can surely gain 2 units of information is to compare two keys with status N. Consider n is even, we can do n/2 comparisons to gain n units of information. For each other comparison, at most one information unit can be gained, the algorithm will do n-2 more comparisons to gain n-2 information unit and thus finish the task. Then the total comparisons are n/2+n+2 = 3n/2-2. When n is odd, it is the same.

Find the second largest key

Naively, we can find the second largest key by first finding the largest, then remove the largest and find the next largest. This will cost 2n-3 comparisons, which is not optimal.

We want to show that any algorithm that can find the second largest in a list of n keys must do at least n+[lgn]-2 comparisons in the worst case.

We prove this by using an adversary lower bound argument. 我们设置一种最难的对手棋，使得最高效的办法遇到这样的对手棋都会在找到最大元素的过程中，使得最大的元素至少和lg(n)个元素做了比较。

又因为有一种很自然的找寻second largest element的方法，就是把n个元素两两配对，赢的n/2个元素进入下一轮，再两两配对，就像锦标赛一样，这样最后决出的胜者就是最大的元素，并且在这个过程中保证了最大元素和其他元素比较了lg(n)次。

这种naive的方法可以达到lower bound，而我们找出的对手棋下法，又使在最优方法下的lower bound和naive方法的比较次数相同，这就说明这个naive的方法实际上就是最优的。（因为任何一种方法的比较次数都要at least $\geq$ lower bound)。

Now let’s see how this 对手棋怎么下

第一步是给所有n个元素一个weight，都为1.

然后根据某种思路，选择两个元素，进行比较，根据下表中的case，对手棋会决定谁大还是谁小可以最大程度让你的最大元素比更多次。对手棋的策略如下表。

简单的解释：

如果w(x)>w(y)，那么对手棋就令x>y（其实就模拟了input可能出现的一种情况)。然后更新两个元素的权重。

要找到max，那么最终我们要看哪个元素的weight达到了n。

很自然的，如果我们有一列数 4，3，2，1.

我们希望先用4和3比，发现4比3大，然后再用3和2和1比，这样4就只比了一次就知道自己是最大的。但聪明的对手棋不会让你这么做。在4比3大以后，4的权重变为2，3的权重归零。以后用3和其他元素比的时候，对手棋会令其他元素一定比3大（或者一样大）。这时候就stuck了。

lemma: Let x be the key of non-zero weight when the algorithm stops. The x = [max] and x has at least directly won [lgn] distinct keys.

假设最大的key通过K次比较使权重变为n。

$w_{k}=w(x)$ 表示第k次比较后的权重

根据对手棋的下法$w_{k} \leq 2 w_{k-1}$

所以

$n=w_{K} \leq 2^{K} w_{0}=2^{K}$

$K \geq\lceil\lg n\rceil$

证明结束

Project Selection Problem

Project selection problem is to select a feasible set of projects with maximum profits

Some projects can have a positive benefit, while other projects will have a negative benefit.

The thing is some projects are connected with each other, which means sometimes project A is the prerequisite for project B, etc……

Design the algorithm

Analyzing the algorithm

算法解释：

我们可以看到算出的这个要使the cut的capacity最小，那么$\sum_{i \in A} p_{i}$ 就得最大，又因为这是一个s-t cut，所有和s相连接的project会和所有和t相连接的project分开，那么在cut完以后和s相连的project就是一个feasible set。

我们把一个找最大profit project sets的问题通过netflow转换成一个minimum cut问题，碰巧解决这个minimum cut问题的dual问题就是解决project selection问题。这里注意，我们把负profit的project先搞成正的，然后连上sink，这种思路真的很难想到。。。

Baseball Elimination

The Problem description

NBA常规赛打到最后阶段，一些机构已经可以知道哪些球队一定拿不了西部冠军或者东部冠军了。这个问题就是通过各支球队已经有的胜负场数和将来的比赛，来判断是否有球队已经退出了头名的争夺，很有意思，这里继续会用netflow方法求解。

Algorithm design

Suppose we have a set of $S$ teams, and for each $x\in S$, its current number of wins is $w_x$. Also for any team $x$ and $y$ in $S$, they still have to play $g_{xy}$ games against each other. Finally, we are given a specific team $z$ and the problem asks us if $z$ is eleminated from the first prize or not.

一些解释：

z想要拿第一，接下来的比赛要全赢，最后z总共赢下m场比赛。

但如果存在这样一个队伍集合，他们现在的总胜场数加上未来在这些队伍之间的所有比赛数（随便在哪两个队之间进行比赛，总能产生一个赢家，所以胜场会+1）> m|T|, 这意味着一定会有一支队伍最后的胜场数是大于m的，也就是说这支队伍才是最后的冠军，而z肯定拿不了冠军了。

一些解释：

这个netflow蕴含了所有使得z拿冠军的可能性，从source连出去的edge capacity表示任意两个队的比赛场数不能超过他们本来的比赛场数。而连在sink上的edge capcity $= m-w_x$ 这保证了最后流向队伍x的胜利数不足以让它超过队伍z。

现在把从source到第一级vertex的所有edge capacity全部拉满，表示真实情况。那么如果这个netflow能承受的来，就证明z有机会拿第一。但是，我们如果换个说法，首先计算这个netflow的最大承受力，如果这个承受力直接就小于拉满后source向第一级vertex输送的flow，那就直接可以说z已经被eliminate了。很自然。

NP-hard and NP-complete Problems

Pseudo-polynomial

Definition of pseudo-polynomial is polynomial in input $n$ but exponential in the toal length of input.

例如传统的排序算法，复杂度为$O(n^2)$, 当输入规模为x = 32n时，复杂度$O(x^2)$
而检查素数算法，复杂度为$O(n^4)$, 如果用二进制表示的话，当输入规模为x时，可以表示最大的数字是$n = 2^x$，显然不是polynomial的了

Reduction via “Gadgets”: the satisfiability problem

The SAT and 3-SAT problems

Suppose we are given a set X of n Boolean variables $x_1,\ldots,x_n$; each can take the value 0 or 1. By a term over X, we mean one of the variables $x_i$ or its negation $\bar{x}{i}$. Finally, a clause is simply a disjunction of distinct terms $t{1} \vee t_{2} \vee \cdots \vee t_{\ell}$. We say the clause has length l if it contains l terms.

An assignment satisfies a collection of clauses $C_1,\ldots,C_k$ if it causes all of the $C_i$ to evaluate to 1; In other words, if it causes the conjunction $C_{1} \wedge C_{2} \wedge \cdots \wedge C_{k}$ to evaluate to 1. In this case, we will say that $v$ is a satisfying assignment w.r.t $C_1,\ldots,C_k$; and that the set of clauses $C_1,\ldots,C_k$ is satisfiable.

Here is a simple example. Suppose we have the three clauses:

$\left(x_{1} \vee \overline{x_{2}}\right),(\overline{x_{1}} \vee \overline{x_{3}}),\left(x_{2} \vee \overline{x_{3}}\right)$

The assignment v that sets all variables to 1 is not a satisfying assignment, but the assignment v that sets all variables to 0 is a satisfying assignment.

The Satisfiability Problem is also referred to as SAT.

There is a special case of SAT which is called 3-Satisfiability:

Reducing 3-SAT to Independent Set

主要内容：假设independent set 问题被黑箱解决，我们是否能通过把3-SAT问题转换为Independent set 问题求解？答案是可以的。

We want to prove that $3 \text { -SAT } \leq_{P} \text { Independent Set. }$.

First, we need to know that a different way to picture the same 3-SAT instance is as follows: You need to choose one term from each clause, then find a truth assignment that causes all these terms to evaluate to 1, thereby satisfying all clauses. So you succeed if you can select a term from each clause in such a way that no two selected terms “conflict”. For example, in first triangle, I choose $x_1$ = 1, but in the second triangle, I choose $\bar{x_1} = 1$, which conflicts.

Our reduction will be based on this knowledge.

First construct a graph above.(构造这样一个图形，使得这两个问题等价，那么如果构造成上图的形状，其实是很自然的) Consider what the independent sets of size k look like in this graph: Since two vertices cannot be selected from the same triangle, they consist of all ways of choosing one vertex from each of the triangles. This is implementing our goal of chossing a term in each clause that will evaluate to 1. But we have so far not prevented ourselves from choosing two terms that conflict.

We encode conflicts by adding some more edges to the graph. For each pair of vertices whose labels correspond to terms that conflict, we add an edge between them. 这有可能destroy all the independent sets of size k的假设，不过这确使得这两个问题真正等价了。于是解决SAT问题就是解决这个图是否存在independent sets of size k的问题了。

最后详细的证明这样构造的图可以用来解决3-SAT问题。

Transitivity of Reductions

$\text { If } Z \leq_{p} Y, \text { and } Y \leq_{p} X, \text { then } Z \leq_{p} X$

Proof:

Efficient Certification

Consider the problem as follows:

The input to a computational problem will be encoded as a finite binary string s. We denote the length of a string s by $|s|$. We will identify a decision problem X with the set of strings on which the answer is “yes”. An algorithm A for a decision problem receives an input string s and returns the values “yes” or “no”, we will denote this returned value by A(s). We say that A solves the problem X if for all strings s, we have A(s)=yes if and only if $S \in X$.

efficient certification简单来说就是不正面解决问题，而是找到另一个algorithm B 使得存在一个长度小于polynomial(|s|)的字符串t，满足$B(s, t)=\text { yes. }$.

另一种理解方式是你有一个poly-size的certificate，然后用一个poly-time的verification algorithm去验证她。

NP: A class of problems

We define NP to be the set of all problems for which there exists an efficient certifier. Here is one thing we can observe immediately.

$\mathcal{P} \subseteq \mathcal{N} \mathcal{P}$

假设vertex cover问题有polynomial solution A。那么可以把s想象成任意一个graph被当成输进去。t可以想象成一组vertex集合的挑选。

We can easily check that the problems introduced in the first two sections belong to NP: it is a matter of determining how an efficient certifier for each of them will make use of a ‘certificate’ string t.

For example:

但是我们没有办法证明 any of these problems require more than polynomial time to solve. Indeed, we cannot prove that there is any problem in NP that does not belong to P.

So we could ask a question: does P = NP
$$
\text { Is there a problem in } \mathcal{N P} \text { that does not belong to } \mathcal{P} \text { ? Does } \mathcal{P}=\mathcal{N P} ?
$$

NP-Complete Problems

What are the hardest problems in NP? Polynomial-time reducibility gives us a way of addressing this question and gaining insight into the structure of NP.

The most natural way to define a “hardest” problem X is via the following two properties:

(1) $X \in \mathcal{N}^{g}$

(ii) for all $Y \in \mathcal{N} \mathcal{P}, Y \leq_{P} X$.

In other words, we require that every problem in NP can be reduced to X, We will call such an X an NP-complete problem. 这里，我们一般认为，解决了难的问题，自然可以用难的问题去解决简单的问题。所以很自然的会把简单的问题reduce成难的问题的一个特例？这种理解好像是很自然的。比如A can be reduced to B,那么意思是解决了B自然可以解决A，所以自然地说B应该要比A难做。

A curcial consequence:

NP-complete problem example: Circuit Satisfiability Problem

We are given the circuit as input, and we need to decide whether there is an assignment of values to the inputs that causes the ouput to take the value 1.

Circuit Satisfiability is NP-complete.

Let’s try to reduce some NP problem to this circuit satisfiability problem.

Proving Further Problems NP-Complete

If Y is an NP-complete problem, and X is a problem in NP with the property that $Y \leq_{p} X$, then $X$ is NP-complete.

General Strategy for proving new problems NP-Complete

To show a problem is NP complete, you need to:

Show it in NP

In other words, given some information C, you can create a polynomial time algorithm V that will verify for every possible input X whether X is in your domain or not.

Example

Prove that the problem of vertex covers (that is, for some graph G, does it have a vertex cover set of size k such that every edge in G has at least one vertex in the cover set?) is in NP:

our input X is some graph G and some number k (this is from the problem definition)
Take our information C to be “any possible subset of vertices in graph G of size k“
Then we can write an algorithm V that, given G, k and C, will return whether that set of vertices is a vertex cover for G or not, in polynomial time.

Then for every graph G, if there exists some “possible subset of vertices in G of size k“ which is a vertex cover, then G is in NP.

Note that we do not need to find C in polynomial time. If we could, the problem would be in `P.

Note that algorithm V should work for every G, for some C. For every input there should exist information that could help us verify whether the input is in the problem domain or not. That is, there should not be an input where the information doesn’t exist.

Prove it is NP Hard

This involves getting a known NP-complete problem like SAT, the set of boolean expressions in the form:

(A or B or C) and (D or E or F) and …

where the expression is satisfiable, that is there exists some setting for these booleans, which makes the expression true.

Then reduce the NP-complete problem to your problem in polynomial time.

That is, given some input X for SAT (or whatever NP-complete problem you are using), create some input Y for your problem, such that X is in SAT if and only if Y is in your problem. The function f : X -> Y must run in polynomial time.

In the example above, the input Y would be the graph G and the size of the vertex cover k.

For a full proof, you’d have to prove both:

that X is in SAT => Y in your problem
and Y in your problem => X in SAT.

marcog’s answer has a link with several other NP-complete problems you could reduce to your problem.

Footnote: In step 2 (Prove it is NP-hard), reducing another NP-hard (not necessarily NP-complete) problem to the current problem will do, since NP-complete problems are a subset of NP-hard problems (that are also in NP).

Example of reduction: reduce 3-SAT problem to subset sum problem

Reduction of 3-Sat to Subset Sum:

n variables $x_i$ and $m$ clauses $c_j$
For each variable $x_i$, construct numbers $t_i$ and $f_i$ of $n+m$ digits.
- The i-th digit of $t_i$ and $f_i$ is equal to 1.
- For $n+1 \leq j \leq n+m$, the j-th digit of $t_i$ is equal to 1 if $x_i$ is in clause $c_{j-n}$
- For $N+1 \leq j \leq n+m$, the j-th digit of $f_i$ is equal to 1 if $\bar{x_i}$ is in clause $c_{j-n}$.
- All other digits of $t_i$ and $f_i$ are 0.
Example:
For each clause $c_j$, construct numbers $x_j$ and $y_j$ of n+m digits:
- The (n+j)-th digit of $x_j$ and $y_j$ is equal to 1.
- All other digits of $x_j$ and $y_j$ are 0.
Finally, construct a sum number $s$ of n+m digits:
- For $1 \leq j \leq n$, the j-th digit of s is equal to 1.
- For $n+1 \leq j \leq n+m$, the j-th digit of s is equal to 3.
(过于复杂，以后再看，网址是https://www.cs.mcgill.ca/~lyepre/pdf/assignment2-solutions/subsetSumNPCompleteness.pdf)

还有一张我写的cheetsheet。