Differences

This shows you the differences between two versions of the page.

--- courses:cs211:winter2011:journals:wendy:chapter6 [2011/03/14 20:28] – [Section 4: Subset Sums and Knapsacks: Adding a Variable] shangw
+++ courses:cs211:winter2011:journals:wendy:chapter6 [2011/04/03 05:29] (current) – shangw
@@ Line 43: / Line 43: @@
 It is different from the previous two because now we look at two variables, both the weight (not to exceed) and the value (to maximize) instead only 1 to optimize. On the surface, it is a little bit like the weighted Interval problem because it is a single way decision, either include the item or not. However, when we look closer, to decide to include item n or not, we need to see if there is a way to make better use of the weight without including the item. Then we need to calculate all the possible optimal combinations at each weight w from 1 to W with allowing the 1st item, 1st+2nd items, 1st+2nd+3rd items... 1st+2nd+3rd+..nth items. So this takes O(n*W) time. Trace back to find which items to add takes linear time.
 Readability is 7.
+==== Section 5: RNA Secondary Structure Dynamic Programming over Intervals =====
+This is another slightly more complicated problem.
+The problem is RNA matching, which need to satisfy the two ends of a base to be separated by at least 4 elements, no crossing, and AU, CG correspondingly match. We try to find out all the possible matches and filter out the ones are definitely not the optimal matching till at the end there is only 1 left. We try matching any two molecules that are 4 distance apart, if they do not match we just go on to the next one. The difficulty of this problem is that once a new pair appear, we will have further constrains (crossing, 4 elements) because of the appearance of the pair. In fact, in some cases, the original problem will be divided into 2 subproblems (not evenly). That is why we call it "dynamic programming over intervals". Of course, as usual, the solution is built up from the smallest subproblem to make the algorithm linear.
+Even though, realistically, this is not suitable for RNA matching, in my opinion...Because we cant have more than 5 molecules consecutively unmatched and we cant have molecules hanging at the end of the strand without matching, which are possibly the case if we apply the algorithm presented in the book. All the constrains from the real world make bio computation more difficult.
+Readability is 6.
+==== Section 6: Sequence Alignment=====
+This section introduces a fairly commonly seen problem: sequence alignment, which appear in google-search and molecular computational biology. We want to see if two things are "similar"; thanks to some biologist, there is a numerical definition of similarity. What we want to do is to find the parameter to compare.
+The algorithm given by the book is to use the recurrence:
+OPT(i,j)=min(alphaij+OPT(i-1,j-1), delta+OPT(i-1,j), delta+OPT(i,j-1)).
+First possibility is to have i,j matching each other, second is to skip i, third is to skip j.
+To prove the correctness of the dynamic programming, we visualize a graph with each node pointing to the top, right, and the upper right diag. It is not hard to show that the OPT algorithm gives the best solution to go from (00) to any (ij), which essentially means the best alignment.
+However, in real world scenario, we are not going to be given the target to compare with what we type in. In other word, how can the computer know to compare "occurence" with "occurrence"? Guess this is another interesting problem in searching.
+Readability is 7.
+==== Section 7: Sequence Alignment in Linear Space via Divide and Conquer=====
+This section gives an improvement of the algorithm from section 5.
+In real world, especially molecule biology, there are many elements to be aligned and it takes over too much space (O(mn) for the graph) if we use the algorithm given in the last section. Hence, we hope to find a more spacial-efficient algorithm.
+The first idea comes to mind (in the book) is to only record the current column and the one right before it because those are the only relevant information in order to calculate the optimal value.
+However, a problem appears: we cannot record the path leading to the optimal value in that way. Then the book further elaborates the idea using divide-conquer. The algorithm is based on this theorem:
+m and n to be aligned, k be in {0..n} and q be an index that minimizes the quantity f(q,k)+g(q,k). Then there is a corner to corner path of minimum length that passes through (q,k).
+This, together with working the problem backward from (m,n) to (0,0) and forward as the original way, allows us to use divide and conquer. For one of the sequence to be aligned, wlog, say n, we always divide it by half; and the other sequence, m, we divide it based on the value of q. On the way, we record each q.
+It indeed takes O(m+n) space. The recurrence relationship is: T(m,n)<=cmn+T(q,n/2)+T(m-q,n/2)=O(mn), so the running time is roughly the same as the algorithm before. Hence we achieved our goal: shrink the space and maintain the running time.
+I really like the idea and the readability is 9.
+==== Section 8: Shortest Paths in a Graph=====
+This is a variation of the shortest paths in a graph problem with only positive edges. In this extended problem, we allow negative-weighted edges; however we still do not allow a negative cycle-because it can just go on forever around the cycle and go to negative infinitive, and become meaningless.
+When there is negative edges, it is easy to come up with a counter example that our old greedy algorithm no longer works. Then the book tries to modify the greedy algorithm by adding a constant M big enough to each edge so as to make them all positive, which is easy to be seen wrong, too. Hence, we need to turn to dynamic programming.
+The recurrence is:
+OPT(i,v)=min(OPT(i-1,v),min(OPT(i-1,w)+cvw)): that is, to calculate out all the optimal solutions using different number of edges. The calculation, as usual, is built up on smaller sub-problems. The running time, if calculated roughly, is O(n^3), though a closer look refines it to be O(mn).
+This algorithm, like many other dynamic programming, is space-consuming. To save space, when we go through each node, we will record the next node that leading to the optimal solution. In this way, it only takes O(n) space.
+In addition, to save both time and space, we set a stopping signal: when we run into 2 same columns consecutively, it is time to stop because any further update will not change anything.
+The readability is 8.
+==== Section 9: Shortest Paths and Distance Vector Protocols=====
+This is section introduces an algorithm that improves the shortest paths algorithm from section 8.
+First we observe the shortage of the original algorithm: it is a pull-based algorithm, and in  each iteration, each node will contact all its neighbors and "pull" the new value from it; however, if a neighbor remains the same value, then there is no need to pull out the new value from it. The new algorithm is a push-based algorithm. That is, updates will not be provided until the value of a node changes; then the updates will be pushed to its surrounding neighbors, and the method is called the distance vector protocol. The book writes down the detailed algorithm, as usual.
+Then it continues to discuss some associated concerns with the distance vector protocol. In real life scenario, if one of the edges is deleted, which totally cut-off all the connections between a node a and the ending node b, the distance vector protocol will not cease updating M[a] and never realize there is no path available. And this, of course, can be problematic.
+The way to solve it is to keep track of the big picture: each node not only storing knowledge of its neighbor but the entire network to ensure there actually exists a path. The extra storage will take over more space but is indeed necessary.
+Readability is 7.