This is an old revision of the document!
Table of Contents
Chapter 5: Divide and Conquer
Intro
- Summary:
There isn't always a Greedy Algorithm or a Divide and Conquer Algorithm to solve a problem efficiently. Dynamic programming is a faster and more subtle technique. “…the basic idea is drawn from the intuition of divide and conquer and is essentially the opposite of the greedy strategy: one implicitly explores the space of all of the possible solutions, by carefully decomposing things into a series of subproblems, and then building up correct solutions to larger and larger subproblems.” “Dangerously close to brute-force.”
Section 1: Weighted Interval Scheduling: A Recursive Procedure
- Summary:
This is like the interval scheduling problem, but every job has a weight and we want to maximize the sum of the weights/values. Again, we sort them in order of finish time (larger indexes are earlier end times). Notation: for a job i, p(i) is the largest index of a job that's compatible with i. Start off by saying something obvious about an optimal solution. It either includes the nth (latest end time) job or it doesn't. If it does include the nth job then nothing that comes after p(n) could possibly be in the optimal solution, so the optimal solution includes the nth job and the optimal solution for the smaller problem of all of the jobs compatible with n. If the nth job is not in the optimal solution, then the optimal solution is the optimal solution for the smaller problem of all of the jobs except the nth job. And whichever of those two gives a better solution is actually the optimal. I.e. OPT(j) = max(Vj + OPT(p(j)), OPT(j-1)). And then we know j belongs to the optimal solution iff Vj + OPT(p(j))>= OPT(j-1). If we just ran this straight up with recursion (they give an algorithm, but it's pretty clear), it would take exponential time. BUT we can memoize! This only runs in exponential time because of the redundancy of the recursion. We recompute some of those OPT(x)'s a bunch of times! Instead, we compute it once and put it in an array M[] when we compute it, so that every other time we need it, we only have to do a lookup instead of a recursive computation. And since we only fill in every thing once (i.e. we do OPT(0), OPT(1), OPT(2), …, OPT(n), but we only do each one once), it takes O(n) time (wayyy better than exponential!). I already kind of mentioned how you'd figure out a solution (not just a value) based on the result (it's in there if Vj + OPT(p(j)) >= OPT(j-1)), but they give an algorithm for it and show that it also only takes O(n) time to figure out.
- Insights and Questions:
None.
- Readability:
Awesome! This book is great.
Section 2: Principles of Dynamic Programming: Memoization or Iteration over Subproblems
- Summary:
The key to the efficient algorithm is the array M. Once we have that, we can find the solution in O(n) time, and we can just compute the entries directly (iteratively) instead of recursively. They talk about this in the context of the previous problem if I need any more details (p.259). This new way (iterative way) of computing the array gives a general template for designing dynamic programming algorithms. Properties that need to be satisfied: (i) there are only a polynomial number of subproblems. (ii) the solution to the original problem can be easily computed from the solutions to the subproblems (ex. actual solution might be one of the subproblems). (iii) there's a natural ordering on subproblems from “smallest” to “largest” together with an easy-to-compute recurrence that allows one to determine the solution to a subproblem from the solutions to some number of smaller subproblems. Sometimes easier to define recurrence first and then think about structure of optimal solution, sometimes the other way around (chicken & egg-esque problem).
- Insights and Questions:
No biggies. This is pretty abstract, and I get how it applies to our problems (but I might not look at it this way to figure the problem out).
- Readability:
Hmm. I guess I get why they included this, but I don't really think it helped me understand the other examples any better. But it was easy to read and made sense.
Section 3: Segmented Least Squares: Multi-way Choices
- Summary:
The last problem had a decision to either include the job or not (binary); this problem will have a multi-way decision. It also shows how a clean algorithmic definition can formalize a fuzzy problem. Say you've got a bunch of points on a graph and you want to find the line of best fit. The error of the line is the sum of its squared “distances” to the points. So we want to find the line with minimum error (they give the math for it in a closed-form solution). But that math formula doesn't handle the case where points seem to lie on a sequence of two or more lines (i.e. one line would give lots of error, but two would be pretty accurate). So now the problem is to fit the lines well with as few lines as possible. They give us formal notation and info, but basically there's a “penalty” of a partition that's the sum of the number of lines you use times a constant multiplier C and the error value of the optimal line through each segment (we want to minimize that, of course). Observation: the last point pn belongs to a single segment in the optimal partition, and that segment begins at some earlier point pi. If we can figure out what i is, we can remove the points from i+1 to n and solve recursively on the leftover points. OPT(i) is the optimal solution for points 1 through i. e(i,j) is the minimum error of any line with respect to pi through pj. If the last segment is pi, … , pn, then the value of the optimal solution is OPT(n) = e(i,n) + C + OPT(i-1). So then you figure out the Optimal solution between every set of points i through j and include it if the minimum is obtained using that index i (see p. 265 for more of the algorithm and formal stuff). There are n^2 pairs that we need to analyze, so that's O(n^2) and then we need to use an O(n) algorithm to get e(i,j) for every pair of points i and j, so we wend up with an O(n^3) running time (or just O(n^2) if you've already pre-computed all of the e(i,j) values.
- Insights and Questions:
How do you pick a good C value? If it's too big, you don't get enough lines, but if it's too small you'd get a ton of lines!
- Readability:
Great! I actually understood this much better by reading it than by doing it in class (which is unusual).
Section 4: Subset Sums and Knapsacks: Adding a Variable
- Summary:
This is like problems we've solved before, but the “obvious” set of subproblems isn't enough so we create more by adding a new variable to the recurrence. Problem: we've got a single machine that can process jobs and we have requests {1, 2, … , n}. We can only use the resource between times 0 and W. Each request requires a time wi to process. Goal is to keep the machine as busy as possible up to the cutoff. That is, we've got n items, each with a weight wi, and we have a bound W and we want a subset of S of the items that is as large as possible – the “Subset Sum Problem.” This is a special case of the Knapsack Problem, where each request has a value vi in addition to the weight and you're trying to maximize the value. The strategy we used before where the optimal solution depended only on whether the last request was accepted or rejected doesn't work anymore. We need a better algorithm. We need to know the value of OPT(n-1) and also the value of the best solution we get for using the first n-1 items and a total allowed weight of W-wn. That means we have to solve a lot more problems. So the solution is either going to be OPT(n,W)=OPT(n-1,W) if n's not in the optimal solution OR OPT(n, W) = wn + OPT(n-1,W-wn) if n is in the optimal solution since the capacity is decreased if we put the nth item in there. The first case happens when the item's too big to fit, so we do that for sure if it's too big; otherwise we pick the choice that gives the better solution. So then this problem is going to be like filling in a table of subproblems (we compute for every possible weight matched to every n), so we end up with a running time proportional to the number of entries in the table, or O(nW). We can recover the optimal set of items in O(n) time. Then they get into the Knapsack problem, which we've already basically given a solution to (if I need more I can look back at class notes).
- Insights and Questions:
None.
- Readability:
They think the Knapsack problem is more complicated than the first problem they introduce?!? I don't think that other problem was necessary to introduce Knapsack; I think it made everything more confusing. Knapsack is pretty clear (probably because it's a physical problem) … and it felt very jammed with two such different problems in the same section. The connection between the two problems was interesting, though.
Section 5: RNA Secondary Structure: Dynamic Programming on Intervals
- Summary:
This problem will require adding a new variable like in the Knapsack problem. DNA is “zipped” together by complementary “base-pairing.” If you look at the DNA as bases chosen from {A,C,G,T} and A-T pair up and C-G pair up, then those pairings hold the DNA together. RNA is only one “strand” of bases, so it loops around and pairs back up with itself. The set of pairs formed is called the secondary structure. So we're faced with the problem of having a strand of RNA, viewed as a sequence of n symbols drawn from {A,C,G,U} (the ith symbol is denoted bi). A and U match up and C and G match up (each can only be matched to one other base). Here are the conditions for matchings: no sharp turns, elements in a pair are either A and U or C and G, no base appears in more than one pair, noncrossing. (See book p274 or notes for clarification on conditions). We want to figure out how to get a secondary structure with as many base pairs as possible for a given strand of RNA. In a first attempt at a solution, you start at the end of the strand and try to match that base to some other base. You run into problems because your subproblems will be the optimal solution for the RNA up to the attempted match and the optimal solution for the part of the RNA between the two you just matched (the other option is that that base doesn't match to anything, but that doesn't cause problems). This is an issue, because usually we call OPT(#), but here we need to specify start and end points in the call. So we just pass in two numbers as start and end points for OPT. See the book p276 or our notes for specifics on the algorithm, but it's essentially the same as others, choosing to either pair the base with something or not to. If so, we get the opt of 1+OPT(the stuff before the first thing in the pairing)+OPT(the stuff between the two pairings). If not, we just do OPT(the stuff before the base that you're not pairing). And we keep everything in a memoized table and trace back through that to get the actual secondary structure.
- Insights and Questions:
I'm not sure why this was so much more complicated than previous problems. I mean, it seems intuitive that you would need two variables for the start and endpoints, and I'm not sure why they needed to show the initial, incorrect attempt at a solution (I didn't think it made it any more clear).
- Readability:
I thought it was very readable, even though I don't have any idea about biology-type stuff, I understood their explanation of the problem without much trouble.
Section 6: Sequence Alignment
- Summary:
Sequence alignment is a problem that comes up when you're comparing strings. For example, it's used online a lot when you misspell words a little, it asks you, “Did you mean ___?” Ex. ocurrance and occurrence. We see intuitively that they are similar, but how does a computer see that they're similar? Well, we can almost line them up letter-by-letter … o-currance vs. occurence, with one gap and one mismatched vowel (e to a). So we're going to use a model that determines similarity based on the number of mismatches and the number of gaps. How do we decide how to line them up? It's kind of easy when we're looking at English words, but when you get into seemingly random strings like the DNA/RNA in biology, it becomes a lot less clear. They determine similarity based on the optimal alignment between two strings X and Y where there's a gap penalty (delta, which I will use “d” for) and a mismatch cost (alphapq, which is 0 if the letters match, and can have different values for different mismatches – ex. a vowel mismatch might not incur as much of a penalty as others). So the cost M of the alignment is the sum of its gap and mismatch costs, and we look for an alignment of minimum cost. The overall method for determining the minimum cost alignment will be OPT(i,j)=min[alpha(xi, yj)+OPT(i-1,j-1), delta+ OPT(i-1,j), delta+OPT(i,j-1)]. That is, we'll start at the back of each of the strings. We can either take a gap cost at the end of one string and then get the minimum alignment of all but the last character of the other string (there are two options here, the second two in the previous equation), or we can match those two letters and take the cost of aligning them (which can be 0 if they match or nonzero if they don't) and then get the minimum alignment of the rest of the strings. They go on to analyze the correctness of the algorithm and show that it runs in O(mn) space (m and n are the number of letters in X and Y respectively). They also explain how you would get the optimal alignment from your memoized table (by getting the shortest path from opposite corners of the matrix … more info on p284).
- Insights and Questions:
I am really curious about how they pick values for delta and alpha. They explain that a lot of thought goes into it, and I definitely see how that could be. It could totally change how you align words.
- Readability:
Great! It was a good review of the algorithm, but I'll be using my class notes to review the process of getting the optimal alignment back out, because their version was harder to follow than what we did (it almost seemed rushed).
Section 7: Sequence Alignment in Linear Space via Divide and Conquer
- Summary:
- Insights and Questions:
- Readability:
Section 8: Shortest Paths in a Graph
- Summary:
- Insights and Questions:
- Readability: