Interest: 7/10. this is back to bit too abstract for me. maybe (in fact probably) this week's reading will help out if I take big data class this fall. it seems network flow would be huge part of big data since it's about partitioning too big a set to compute with x resources into smaller sets with x*y resources
the prob: graphs through nodes and switches act as transportation networks with traffic of certain capacities traveling through, and having source nodes and sink nodes for that data
-flow networks
-defining flow: 0⇐f(e)⇐ce and other than s t flow into node matches flow out of node -the max-flow prob: arrange traffic in a flow network as efficiently as poss (max the flow value)
>defining the alg: there's no general alg to do it, so to design one think in terms of pushing forward on edges with leftover capa, and backward on edges already carrying flow; this is a residual graph -the residual graph: node set of Gf is same as G, ce-f(e) leftover units of capacity on edges to push forward to, and backward edges e'=(v,u) to undo flow -aug paths in a res graph aug(f,P) let b=bottleneck(P,f) for each edge in P
if edge is forward: increase f(e) in G by b else: decrease f(e) in G by b * 7.1 f' is a flow in G * proof: 0<=f(e)<=f'(e)=f(e)+ bottleneck(p,f)<=f(e)+(ce-fe) = ce for forward edges and reverse the <= to >= and subtract bottleneck for backward edges
Max-Flow all e in G to 0 while there's path s-t in Gf
let P be that simple plath f'=augment(f,P) update f to be f' update Gf to be Gf'
return f
>ana the alg: termination and RT -7.2: at every intermediate stage of FFA, flow values{fe} and the resi caps in Gf are ints. Proof: suppose true after the while loop j iterations, then bn(P,f) will be an it and f' will have ints so will the new res graph -7.3: let f be a flow in G, and P a simple st path in G. Then vf'=vf+bn(P,f); and since bn(P,f)>0 we have v' > vf. Proof: e must be forward since path is simple, so if we increase bn and not change any incident edge to s, f' exceeds f by bn -7.4: suppose as above that all caps in G are ints. Then the FFA terminates i nat most c iterations of the while loop. Proof: by 7.3 flow in FFA increases each iter, and by 7.2 at least by 1. it starts with 0 and not higher than c, so that's max c iters -7.5 suppose that all caps in G are ints, then FFA can be implemented to run in mC time. Proof: most work done in one iter is f. have two linked lists for each node in Gf, then to find path st we can use breadfirst or depthfirst in )(m+n). then for each e we can construct forward and backward edges in Gf' in Om time
ana the alg: flows and cuts: structure is important because the max flow is determined by any point where it's just one point connecting two separate cut sets
-7.6: let f be any st flow and AB and set cut. then vf=fout(A) - fin(A). Proof: vf = foutv-finv for all v in A. Then that equals feoutA - feinA = foutA-finA -7.7 let f be any st flow and AB any st cut then vf=finB-foutB -7.8: vf<c(A,B) (the value of every flow is upper bounded by the cap of every cut
ana the alg: max-flow equals min-cut
-7.9: if f is an st flow such that there is no st in Gf, then there's st (A*,B*) in G for which vf=c(A*,B*). conseq, f has max val of any flow in G and A*B* has the min cap of any st cut in G. Proof: vf=foutA*-finA* = feoutA* - feinA* = ce-0 out of A* = c(A8,B*) -7.10:the flow f_ ret by FFA is a max flow -7.11: given f of max val we can compute st of min cap in m time. proof: 79 then bfs or dfs to find A*, then B*=v-A* and return cut(A*,B*) -7.12: in every flow network there f and cut so that vf=c(A,B) -7.13: in every flow network the max val of st flow is equal to min cap of st cut -further analysis: integer valued flows –7.14: if all caps in flow network are ints then there is max flow f for which every flow val fe is an its –real #s as caps? all the theorems here were assumed that caps had to be ints. but what if they're real? then we have no guarantee that the while loop in the FFA ever terminates since each increment can be inf small; with pathological choices for the aug path, the FFA can run forever. next section is about aug paths to avoid bad behavior of the alg (using reals I think)
the upper bound can be very bad if each bottleneck is the same in graph as in res graph, and if that bottleneck were potentially one. then to get flow value of 200, for example, we'd have to do 200 augmentations
Des a faster flow alg
Scaling Max-Flow set all e in G to 0 set D to be largest power of 2 no larger than max cap out of s while D>=1
while there's st path in GfD let P be a simple st path in GfD f'=aug(f,P) update f to be f' and GfD D = D/2
return f
ana the alg
-7.15: if the caps are int vals then throughout the SMFA the flow and res caps are ints, implying when D=1, GfD is same as Gf and hence when the alg terminates the flow, f is of max value -7.16: the number of iters of the outer while loop is at most 1+logC -7.17: during the Dscaling phase, each aug increases the flow val by at least D -7.18 let f be the flow at the end of the Dscaling phase. There is st cut AB in G for which CAB⇐vf+mD where m is the number of edges in G. conseq, the max flow in the network has val at most vf+mD. Proof: vf = feoutA - feinA >= (ce-D outA) - DeinA = ceoutA - DoutA - DinA >= c(A,B) - mD -7.19 the # of augs in scaling phase at most 2m -7.20 the SMFA in graph with m edges and its is at most 2m(1+logc) augs and runs in at most m^2logc time
extensions: strongly polynomial algs: polynomial number of bits, solved by dinitz, edmonds, karp shows a way to do those in Omn iters; algs to do mnlogn, n^3, and min(n^2/2, m^.5)mlognlogU
the prob: bipartite graph unidirected whose node set can be partitioned. a matching is a subset M such that each node apppears in at most one edge in M
des the alg: direct all X to Y, then add node s and sx to each X then to and Y to t, then each edge a cap of 1
ana the alg
-7.34: M' contains k edges cuz in the cut set, first is cardinality of M' while second is 0 since nothing enters A, so M' contains k edges -7.35: each X is the tail of at most one edge in M': if tail of 2, 2 units of flow leave from x, so 2 must have come into X but the cap is 1 so that can't be -7.36: each Y is the head of at most one edge in M' -7.37: the size of max match in G is equal to val of max flow in G' and the edges in such a matching G are the edges that carry flow from X to Y in G' -bounding the running time
-Extensions: the structure of bip graphs with no perfect matching -what if the graph doesn't have a perfect matching? 7.39: if bib G with two sides XY has perfmat, then for all A in X we must have GamA>=|A| -7.40: assume bip G with XY has |X|=|Y|. then G has either perfmat or there's subset AinX that GamA<|A|. a perf mat or appropriate subset can be found in mn time -proof 7.40: show val of maxflow in G' is n; by mfmct we know A' contains s from XY. assume by 7.11 A'B' can be found by running FFA, then moving from B' to A' on y, yt now crosses the cut. but there was at least one edge xy with x in A since y in GamA so all edges from A and y used to cross the cut and can't anymore. so now the cap of the cut can't be increased, and is |A intersect B'| + |Y intersect A'| so n-A + gamA ⇐ c(A'B') < n
the prob: circulations with demands: consider fixed supply and fixed demand values and shift flow to meet those demands (like highway or railway of shipping routes). So now G has dv>0 and supply of -dv. set S to ann nodes with negative demand T to nodes with pos demands
1) cap conditions: for each e in E we have 0⇐fe⇐ce 2) demand conditions: for each v we have v finv-foutv = dv -feasibility prob: does circulation meeting 1 and 2 exist? -7.49: if there exists feas circ with demands dv then sum of vs in dv = 0 -proof of 7.49: dv must = finv - foutv so e is counted twice and cancel out. so dv = 0
des and ana an alg for circs: attach super source s* and super sink t* to each node in T, then add those to G. for each v in T add edge vt* with cap dv and each u in S add s*u with capacity -du then carry over rest of G to G'
-7.50: there's feas circ with dv in G iff max s*-t* in G' has value D. if all caps and demands in G are ints and there's feas circ, then there's feas circ that is int -7.51: the graph G has fc with dv iff for all cuts AB, dv⇐c(AB)
the prob: circs with demands and lower bounds
-lower bounds are used to force/ensure each edge has a min flow through them, useful so that wasted paths are not there 1) cap conditions: each e in E, le⇐fe⇐ce 2) dem cond: each v in V finv-foutv = dv
des and ana an alg with lower bounds
-consider f0inv - f0outv = leeinv - leeoutv then we have ce-le units to work with in remainder, so in G' the capacities of the edges will be ce-le and demand of node v dv-Lv -7.52: there's fc in G iff there's fc in G'. if all demands, caps, and lower bounds in G are ints, and there's fc, then there's fc that is int -proof of 7.52: finv-foutv = einv(le+f'e) - eoutf(le+f'e) = Lv + (dv-Lv) = dv so G stipulations are met and conversely f'inv - f'outv = (fe-le)einv - (fe-le)eoutv = dv-Lv so G' stipulations are met Interest: 8/10. partly because now we're dealing with non-binary, but multifaceted problems, and largely because it's week11! semester almost over. but not quite.
-des a rec alg
ComputeOpt(j) if j=0 return 0 else return max(vj+ComputeOpt(pj, computeOpt(j-1))
-memoizing the recursion
MCompOpt(j) if j=1 return 0 else if M[j] not empty return M[j] else define M[j] = max(vj+MCompOpt(pj), MCompOpt(j-1)) -analyzing the memoized version
-computing a sol in addition to its value
findSol(j) if j=0 then output nothing else if vj+M(pj)>M(j-1) output j with result of findSol(pj) else output result of findSol(j-1)
-des the alg IterativeCompOpt M(0)=0 for 1..n, M(j)=max(vj+M(pj), M(j-1)) -analyzing the alg: using 6.1 and 6.3 we know by induction that opt(j) is written in M(j), then we can use findSol and IterativeCompOpt to fill out M -a basic outline of dynamic programming: breaking it down into subprobs 1)there are only a polynomial number of subproblems 2)the sol to the original problem can easily be computed from the solutions to the subprobs (the orig might even be 1 subprob) 3) there is a natural ordering from smallest to largest, and easy to compute recurrence that allows one to determine the solution to a subprob from the sols to some number of smaller subprobs There's a chicken and egg problem underlying dyprog because it's never clear that a collection of subprobs will be useful until we find the recurrence linking them, but that is hard to find without smaller subprobs to build on in the first place
-we're no longer in binary mode (ooh wow) -the problem
-formulating the problem. penalty of partition is 1) the number of segments into which we partition P times a fixed given multiplier C>0 2) for each seg, error val of opt line through that seg -designing the alg
segLeastSquares(n) array M(0..n) set M0 = 0 for all i⇐j pairs
compute the least squares error ei,j for pi..pj
for j=1..n, use 6.7 to compute M(j) return M(n)
findSegs(j) if j=0 output nothing else
find i that mins rec relation output seg pi..pj and the result of findSegs(i-1)
-analyzing the alg: it's n2 because there is a pair to check for n entries. and then to find ei,j is also linear so its n3, but we can work it down to n2 by using the ingredients of calc for ei,j-1 to find ei,j in constant time. so RT: n^2
-the problem
-des the alg
subsetSum(n,w) array M(0..n) init M(0,w) = 0 for each w=0..w for i=1..n
for w=0..w use 6.8 to compute M(i,w)
return M(n,w) -analyzing the alg
-extension: the kn prob: what about nonnegative weight wi and distinct val vj, but now we have to find max val with a restriction on what the max value is (not just the size of kn now but the value). bound wi⇐W 1) if n not in O, opt(n-1,W) 2) if n in O, opt(n-1), W-wn)
Interest: 9/10. You're right, the finding closest pair of a sorted list in linear time is cool.
DAC using recursion on q subprobs of n/2 size each
5.3: for some c, Tn ⇐ qTn/2 +cn when n>2 and T2 ⇐ c
The case of q>2 subprobs
* analyze first few levels for prob of size n; usually total work per level increases as recu does
>the case of one subprob
>a related recurrence: Tn ⇐ 2Tn/2 + n^2
the prob: rankings, collaborative filtering, having to count how many inversions your pref list has with one person/many other people's pref list(s). also about how to analyze that data and what sorts of conclusions can be drawn from them
des and ana the alg
* we can do better than taking each number and comparing all the others (n^2). we can do nlogn
maintain Current pointer init to first el maintain Count for inv, init to 0 while both A and B not empty append smaller of ai and bj to output list if bj is smaller, increment Count by num of el in A left advance pointer of list of the smaller num append remainder of other list to output return count and merged list
if list has one el, no inv else div list into 2 halves w/ a first n/2 and b second n/2 els (ra, A) = sort and count(a) (rb, B) = sort and count(b) (r,L) = merge and count(a,b) return r=ra + rb + r and sorted list L
the prob: n points in plane, find pair that is closest; part of computational geometry (graphics, comp vision, geographic info sys, molecular modeling). we gotta do better than n^2. in this chap we'll do nlogn time, but in chap 13, there's an n one (say wha??)
des the alg: set of points P, pi has coords (xi,yi) and d(pi,pj) is the distance b/t the two pts
* DAC style is to find closes pair among left half then right half then get overall sol in linear; this can be nlogn with the combining part being an additional n
construct px and py nlogn time p*0, p*1 = closest pair rec(px, py)
closest-pair-rec(px,py) if |p| ⇐ 3, then find closes pair by meas all pairwise dist
construct qx,qy,rx,ry in n time q*0,q*1 = cpr(qx,qy) r*0,r*1 = cpr(rx,ry)
de = min above 2 lines x* = max x coord of pt in Q L = {(x,y):x=x*} S = pts in P within de of L
construct Sy n time for each s in Sy, compute dist from s to each of next 15 pts in Sy; let s, s' be the min dist from these pairs ntime
if d(s,s')<de, return d(s,s') else if d(q*0,q*1)<d(r*0,r*1): return [q] else return [r]
Interest: 8/10 because this is only the 2nd new type of alg we've come across in 7 weeks of the class
the problem
* have to classify/org a collection of objects
>designing the alg
the problem
* encoding symbols using bits: say we set four bits for each letter of the alph… but fixed bits isn't eff because not all letters are used frequently; reducing av num bits/letter is a fund prob of data compression, and here's where compression algs come into play
>des the alg (greedy again)
rep prefix codes using bin trees
* 4.27: the enc of S constructed from T is a prefix code
»a first attempt: the top-down approach
»what if we knew the structure of the OPC?
»an alg to construct OPC; Huffman's alg (greedy because it's same frequencies and repps but with as few letters repping as possible; bottom up as opposed to Shannon-Fado top down) to con a pref code for alph S with given freqs: if S has 2 letters: enc one letter with 0, the other with 1 else let y* and z* be lowest FL form new alph S' without y* and z*, replaced w or the sum of their freqs recursively construct pref code y' for S' with tree T' pref code for S is: start with T', take leaf labeled w and add two children y* and z* to it
analyzing the algits optimality
* 4.32: ABL(T') = ABL(T) - fw
»impl and RT: k^2; but using pq, it can be klogk
extensions
* BW images where one color overwhelms the other can be compressed using arithmetic coding algs
>approaches to solving recurrences: 1) unroll using rec by finding RT of first few levels and then finding a pattern then summing it all up to get total RT or 2) plugging-in a guess into the recurrence relation and seeing if it works
unrolling the mergesort recurrence: 5.2: any function T satisfying 5.1 is bounded by nlogn when n>1
subbing a solution into the MR: T(n) = 2c(n/2)log(n/2)+cn=cnlogn - cn_cn = cnlogn pg 213
an approach using partial substitution (weaker of the two):T(n)⇐2k(n/2)log(n/2)+cn = kn(log2-1)+cn = T(n) ⇐ knlogn - kn + cn ⇐ knlogn
Interest: 7/10
The problem: one resource to meet n requests for an interval of time. Each req has deadline d. Lateness means the deadline was missed. We want to minimize the max lateness of a request
Designing the Alg
* Order the jobs in order of increasing length t to get the short jobs done first; but this completely ignores d, so it's bad
consider jobs i=1…n in this order
assign i to the time int from s(i) = f to f(i) = f+t let f=f+t return the set of sch ints s(i),f(i) for i=1...n
>Analyzing the Alg
Proof: a) If O has inv, then there's some job with dj < di b) After swapping i and j we get a sch with one less inversion c) the new swapped sch has a max lateness no larger than O. (i cannot be more late in the sch O than j was in the sch O)
>Extensions: say all jobs have same s, or if they have release time r before which they can't be completed
problem: given start node s, what is the shortest path from s to each other node?
Designing the Alg / Dijkstra's Alg
Dijstra's Alg(G,l)
let S be the set of explored nodes for each u in S, store distance d(u) S = {s} and d(s) = 0 while S != V select node v not in S with at least one edge from S for which d'(v) = min[e(u,v)]d(u) + le is as small as possible add v to S and define d(v) = d'(v)
>Analyzing the Alg
Proof: S=1 is easy cuz both S and d = 0. Induction hypothesis: Pu is shortest su path for each u in S. Let y be the first node on P that is not in S and let x in S be the node before y. P can't be shorter than Pv because it is already at least as long as Pv by the time it has left S *
>Imp and RT: it is O(mn) but can be improved by maintaining d'(v) for each node, and then having a pq for the nodes V-S; then have ChangeKey and ExtractMin ops -4.15 Using a pq, DA can be imp on a graph with n nodes and m edges to run in O(m) time, plus the time for n ExtractMin and m ChangeKey ops
Prob: connect n nodes with the fewest (and cheapest) edges possible
* 4.16 let T be a min cost solution to the network design problem defined above. Then (V,T) is a tree Proof: (V,T) must be connected; if it contained C, then (V,T-{e}) is also a valid solution to the problem and cheaper than taking the long way around the remainder of C; but that's a contradiction, so it has no C and is a tree
>Designing Algs
>Analyzing the Algs
Proof: we can identify edge e' in T that is more expensive than e, and then exchanging e for e' results in another spanning tree cheaper than T
Proof: v in S but w not in S, no oedge from S to V-S has been encountered yet or else KA would have added it; hence e is the cheapest edge with one end in S the other in V-S and by 4.17 belongs to every MST
Proof: in each iteration of the alg there's v and e that minimize q; 4.17
Proof: delete e from T, partitioning two components S and V-S. follow P from v to w, from S to V-S, and we find some edge e' on P that crosses over; by 4.17 T' is a spanning tree of G, and since e is the most expensive of C and e' belongs to C, e' is cheaper than e and so T' is cheaper than T
Proof: any time e is removed, it lies in C; it is the most expensive in C be def of RDA so by 4.20 it is not part of a MST
>Implementing Prim's Alg
The problem. The UF ds will support 3 ops: 1) MakeUnionFind(S) will return a UF on S where all elements are in sep sets; the goal for this is O(n) where n = |S|. 2) Find(u) will return the name of the set containing u. Goal is O(logn) although K is possible. 3) Union(A,B) will change the ds by merging A and B into one set; Goal is logn
A simple ds for uf
*maintain array Component
>A bs for uf
Proof:every time the name of the set with v nodes changes, size of this set doubles, since the set containing v starts at 1 and is never larger than n, its size can double at most logn times so there is at most logn name changes
Further improvements: compression idea is to make subsequent calls to Find cheaper by bounding the total runtime for a sequence of n Find ops rather than worst case time for any one of them. Inverse Ackermann function
Impls KA: sort edges by cost mlogm, then UF to maintain conn comps of V,T and compute Find(u) and Find(v) over and over for a total of at most 2m Find and n-1 Union ops
*4.25 KA can be impl on a graph with n nodes and m edges to run in O(mlogn) time
How to define it? Small steps, local steps that maximize work each time so that the bigger picture/problem is solved in the fastest time or in the smallest number of steps; I guess they're extremely efficient algs
Guaranteed close to optimal, and often times optimal
Approaches: 'stays ahead' and 'exchange'
Apps: shortest path in a graph, Minimum Spanning Tree Problem, Huffman codes for data compression, clustering, Minimum-Cost Arborescence Problem
Interest on 3.4-4.1: 8/10. The book is surprisingly helpful because my first exposure to these concepts is in class, so when these readings are due it's more like reviewing or at least visiting something that has vaguely imprinted on my brain. I put a 7/10 originally, but then realized that GAs would probably be great to know in the future because they're very practical - anytime many tasks need to be done in a crunched time (meaning always haha) they come in handy and automate the scheduling process for us.
We want to max number of compatible subsets; when max size, it will be called optimal
Designing a GA
-Select request i1, reject incompatibles, then i2 and so on until all requests met -Some rules: 1) always select available request that starts earliest (not opt cuz earliest request might be the longest) 2) request that takes shortest time to complete (bad if it lies over the edges of two other reqs) 3) for each req, count the number of reqs not comp and start with the fewest incomp req, go by the earliest finish times Alg: let R be set of reqs and A empty while R not empty choose req i w/ smallest finishing time add i to A del all reqs from R not comp with i
Analyzing the alg
-A is a compatible set of requests (otherwise we could not even start the rest of the alg) -Just show that A contains the same # of ints as Optimal and hence is optimal itself; 'stay ahead' -Fact (that we have to show): for all indices r⇐k, we have f(ir)⇐f(jr). The finish of r-1 ⇐ start of jr, meaning jr is in the set R f available intervals at the time the GA selects jr, and so f(ir) ⇐ f(jr) and always will be like that until all the intervals are checked through -Fact: the GA returns optimal set A -Proof: BWOC. If A is not optimal, then A has more requests m>k. Req j(k+1) in O starts after jk ends, and hence after ik ends too. So if we delete all requests not compatible with reqs i1…ik, R still contains j(k+1), but the greedy alg is set to stop after request ik /when set is empty/. -Implementation and Running Time: nlogn. ordering is logn, iterating through intervals and selecting is n
Extensions. Some issues that can arise in its impls are:
-We might not know all the reqs when we start the alg; online algs have to make decisions at the present as time proceeds, with no knowledge of future input -Prioritized requests change their order as we have it currently
A Related Prob: Scheduling All Intervals: if we have many identical resources available and must schedule all reqs with fewest res possible
-Prop: in any instance of Interval Partitioning, the # of res needed is at least the depth of the set of intervals. -Proof: suppose a set of ints has depth D, and I1 to Id all overlap; then each of I must be on different resource, at least d of them -An alg for scheduling all intervals with fewest resources sort ints by start time, breaking ties arbitrarily for j-1…n for each int i preceding and overlapping j, exclude i from consideration for j if any label has not been excluded then assign a nonexcluded label to j else leave j unlabeled -Fact: using GA above, every int will be assigned a label, and no two overlapping ints will receive the same label -Proof: argue no iterval ends up unlabeled Ij, and t intervals earlier overlap it. So t+1 intervals pass some common pt on a time line and ⇐ d. At least one of the d labels is not excluded by t and can be assigned to Ij. Then claim no two overlapping ints have the same label; if I and I' overlap, I is in the set of intervals whose label is excluded from consideration, so the alg will not assign the label for I to I' -Fact: the GA schedules every interval on a resource, using a number of resources equal to the depth of the set of intervals. This is the optimal number of resources needed
Question: In this section we take the approach of comparing in some way the GA to what I'll call “the Optimal Algorithm.” But practically speaking, if we already have the OA in the first place to compare GA to, why don't we just stick with the OA?
Bip graph is one where node set V can be partitioned into sets X and Y so that every edge has one end in X and one end in Y
Fact: if graph G is bip, it cannot contain an odd cycle
Designing the alg: assume G is connected, pick any node s in V and color it red, color its neighbors blue, then alternate; this is identical to BFS but just with an extra array Color
Analyzing the alg
-Let G be connected graph with layers L1, L2…. Then exactly one of the following is true: 1) no edge of G joins two nodes of the same layer or 2) G is not bipartite and has odd-length cycle -Proof: first consider case i where no edge joins two nodes of same layer. every edge of G joins nodes either in same or adjacent layers though. so because of i, every edge here joins two nodes in adjacent layers. but our coloring procedure makes it so that every edge has ends of opp colors, so G must be bipartite (and once again, has an odd cycle)
Repping Dir Graphs: adjacency list, each node has 2 lists assoc with it (one to which and one from which edges)
The Graph Search Alg: use BFS and it is m+n runtime. This computes the set of all nodes t with the prop that s has a path to t (although not always vv). DFS is similar impl. To get a graph of all nodes that reach s, simply get a graph that is the reverse of G
Strong Connectivity (if there is a path from u to v and vv for all two nodes; mutually reachable)
-Fact: if u and v are mut reach, and v and w are mut reach, u and w are mut reach -Proof: to get to w, first we go u to v, then from v to w. these paths are guaranteed by mutual reachability of the nodes. Then we just reverse the reasoning to go from w to u -Strong component containing node s in G: the set of all v such that s and v are mut reach -Fact: for any two nodes and t in a dir graph, their str comps are either identical or disjoint -Proof: consider any two mut reach nodes s and t; claim their strong components are identical; then use the 2nd closest to here thm.
If no cycle, then the graph is actually a tree
Dir graph w/ no cycles is called a directed acyclic graph
Can be used to model precedence relations/dependencies; node for each task, and dir edge to show which task needs to be done first; use topological ordering
Fact: if G has a topo ord, then G is a DAG
-Proof: BWOC G has top ordering and cycle C (this is the contradiction part to prove). But then we can have node vi and vj before vi, even though j > i; so this contradicts topo ordering
Designing and Analyzing the Alg (proving converse of above thm)
-Fact: in every DAG G, there is node v w/ no incoming edges -Proof: (sort of BWOC) let each node in G have at least one incoming edge. How to find a cycle to prove our claim: pick v and go backward from v to u, then x, and so on. Keep going, and after n+1 steps we will have visited some node twice, which means the sequence of nodes we visited makes a cycle -Fact: If G is a DAG, then G has a topo ordering Find node v w/ no inc edges as first Delete v from G Recu compute topo ord of G-{v} and append this order after v -We declare a node to be active if it has not yet been deleted by the alg, and we explicitly maintain two things 1) for each node w the number of inc edges that w has from active nodes 2) the set S of all active nodes in G that have no incoming edges from other active nodes
Interest: this week's 3 sections were pretty interesting for me. I'd say 8/10 because personally I tend to care more things when I can see where it would be used in real life, and graphs are everywhere in my life. I know I should care about theoretical stuff too though and am getting there.
- Graph: pairwise relations among set of objects of V nodes and E edges - There are symmetric (undirected) and asymmetric graphs (directed) with u as a tail and v as the head - They're abstract and hard to define, so here are some examples of graphs: transportation networks (airports, train stations, taxi stations), communication networks (b/t computers, internet providers, networks and routing), information networks WWW), social networks, dependency networks (classes, bank tellers, food ordering) - Paths (connectivity) is moving from one node to another through connected edges. Simple paths mean no node is visited twice, cycle is path that starts and ends in same place - Connected means every pair of nodes u and v has a path between them; distance between two nodes is the min number of edges connecting them (aka short path) - Tree - an undirected graph that is connected and doesn't have a cycle. Root is “top” of rest of nodes, parent precedes child with edge distance of 1, with ancestors and descendants being distances greater than one; leaf has no descendant → concept of hierarchy captured in rooted trees - Fact: every n-node tree has exactly n-1 edges - Statement: G is an undirected graph of n nodes. any two of the following implies the third - g is connected, does not contain a cycle, has n-1 edges
- Problem of connectivity: is there a path from node s to t in graph G? - Breadth-First Search - explore outward from s in all pos directions, adding nodes one layer at a time (find all nodes joined by one edge to s)
L1 is the neighbor of s, which is L0. For L1 to Lj, Lj+1 consists of all nodes not belonging to an earlier layer that have an edge to a node in Lj
Fact: For each layer j>=1, Lj produced by BFS consists of all nodes distance j from s; there's a path from s to t iff t appears in some layer
Fact: Let T be BFS tree, x and y belonging in layers Li and Lj, and x-y is an edge of G. The i and j differ by at most 1
Proof: suppose by way of contradiction that i and j differed by more than 1; since x belongs to layer Li, the only nodes disc from x belongs to Li+1 and earlier; hence if y is a neighbor of x it should have been discovered by this pt at the latest and hence should belong to Li+1 or earlier - Exploring a Connected Component - set of nodes BFS alg finds reachable from s
Alg: R consists of nodes to which s has a path Initially R = {s} While there is an edge u v and u in R but v not in R: Add v to R
Prop: the set R produced is precisely the connected component of G containing s. Proof: we know for any node v in R there's path sv. consider node w, and that BWOC there is sw path P in G. s is in R but w is not so there must be a node v on P not in R; node v is not s; so there is a u before v on P, and edge uv exists; u is in R; uv is an edge where u is in R but v is not in R, which contradicts the stopping rule
-Depth-First Search - follow first edge until dead end, then backtrack; the “maze algorithm”
Alg: DFS(u): Mark u explored and add u to R For each edge uv incident to u: if v is not explored then recursively invoke DFS(v) > Prop: for a given recursive call DFS(u), all explored nodes b/t invocation and end of recursive call are descendants of u in Tree > Prop: For T, nodes x y in T, and x-y an edge of G that isn't an edge of T, either x or y is an ancestor of the other. Proof: xy is an edge of G not of T, x is reached first by DFS. xy is not added to T because it's marked explored. y was not explored earlier, and must have been discovered b/t invocation and end of the recursive call DFS(s) - so it must be a descendant of x
- The Set of All Connected Components - For any two nodes s and t, their connected components are either identical or disjoint. Proof: path st in G; claim conn comp of s and t are same set; so there's a path from t to v and v to s. If there's not path st, there's no v connected comp of each, because then we could walk from sv and then to t, but there is no path
- Repping Graphs - two basic ways are adjacency matrix and adjacency list
Goal run time is O(m+n) where m is the number of edges; note that since m>=n-1, we call O(m+n) linear
Ad matrix: nxm matrix A where A[u,v] is equal to 1 if the graphs contains edge uv and 0 otherwise. It's symmetric for undirected graphs. Two disads are it takes Theta(n^2) space no matter what, and takes O(n) time to check all edges incident to a node no matter what node)
Ad list (good for sparse (fewer than n^2 edges) graphs): actually an array where the index is the node and value is a list of index's adjacent nodes. Takes just O(m+n) space
Ad matrix is constant access, and ad list linear, but ad list is more natural rep of exploring graphs cuz physically it's like learning the neighbors of u when you arrive at u, and once there it is constant time although getting there might be linear
Proof: edge vw contributes twice to the sum of degrees in a graph: once in nv and once in nw. sinc the sum is the total of contributions of each edge, it is 2m
- Queues and Stacks - used when order matters in keeping records
Queue is FIFO and stack is LIFO; both repped as doubly linked list cuz is has explicit First and Last pointers
- Implementing BFS: set only s true, L0 consisting only s, layer counter i=0, current Tree T as empty. while Li not empty, initialize empty list L[i+1] and for each node u in L[i], if edge uv is false then set it to discovered true and add it to tree T and v to list L[i+1]. Then increment layer counter i by one.
Prop: the above impl of BFS runs linear if the graph is given by the ad list rep. n for the for loop n times, and constant considering of each edge of each vertice u times using an additional array Discovered set up outside the for loop
- Implementing DFS: stack S has one element s. while S is not empty take node u from S and if u is false, set to true and add each edge v incident to u to S
Prop: the above alg implements DFS in the sense that it visits the nodes in exactly the same order as the recursive DFS procedure in the previous section (except that the ad list is in reverse order to this)
Prop: the above implementation of DFS runs in O(m+n) time if the graph is given by the ad list rep
-Finding the Set of All Connected Components: start with some node s, use BFS or DFS, then find node v not visited by search from s and use BFS and DFS (v is disjoint from s by def). Linear runtime
Question: how hard is it to set up a tree, graph, stack or queue that has all of the functions specified in these sections? Not the O notation for setting them up but the real-world time and rules for them? Maybe they just seem complicated right now but are not that bad.
-It's good to use data structures to estimate the run time of algs rather than breaking down every tiny step. We'll use lists and arrays for the GS alg
-Need to rep ranking and who is matched with whom and who is free
-Keep a list of n elements as an array of length n, where the A[i] is the ith element of list. If ordered, then search is logn time
-Linked list for the list of men since ll is more dynamic and flexible than array. Create the list and also a pointer that goes to the current pos of the last operation (has deletion and insertion)
-Have an all men or all women array, but two arrays for the preferences of both genders. The goal is that each iteration of the GS alg runs in constant time so that it runs in time proportional to the upper bound of n^2 and no worse
-What has to be C time: who is a free man (linked list, select first man on that list), who m should propose to next (use extra array Next that indicates for each m the pos of the next woman to propose to), is w engaged (use array Current where w is the index and the value is her man), if w prefers m or m' (trickiest of the four: create nxn array at start ofalg, where Ranking[m,w] contains the rank of man m in sorted order of w's preferences. Creation is linear time for each w, and to decide if w prefs m or m' just compare [w,m] with [w,m']
-“Natural space” of a prob: the set of all possible solutions. Nat space and run time are the two bounds to consider when encountering a new problem Linear Time (rt is at most a constant * size of input)
-Computing the Maximum - a linear pass, where constant per element adds up to O(n)
-Merging Two Sorted Lists A and B
(Maintain 2 pointers init to front els While both lists not empty: append smaller of A or B to new list, advance Current pointer of which was smaller If one list is empty, put all the els of the other to the new list) Here's why it's n and not n^2 although the latter is technically correct: count the cost of each iteration of the selected element as constant because once it is first selected (“charged”) it is given to the new list and not seen by the alg again → 2n iterations at most = n.
-nlogn time: the rt of any alg that splits input into 2 even pieces and recurses before combining in linear time. mergesort, or algs whose expensive step is sorting
-Quadratic Time alg for finding closes pair of points using nested loop (For each x,y
For each other x, y compute distance formula update the min accordingly)
As the class progresses though, we'll find a way to do the above task in nlogn and then… n. say wha??
-Cubic time: 3 nested loops
-n^k time alg for checking if there's an edge between two nodes (For each subset s in k, check is S is independent set and if yes, we're done If no k-node ind set was found, declare failure)
-Beyond Polynomial Time:
2^n is a search space for many problems, so if we encounter it, then don't worry because most likely that actually means there's a better alg out there to improve the op. The “dichotomy” of brute-force algs and their alternatives
n! is the number of ways to match n items with n other items (think of this as perm n from 121)
Sublinear Time: possible when input can be “queried” and not read completely. An example is binary search where we prob entries by group of “active regions.” This happens when an alg does a C work to throw away a constant fraction of the input
More complex data structures take nontrivial processing when invoked but afterwards lead to highly efficient algs that would not be using simpler data structures
-PQs support addition and deletion and each element has a key that denotes the priority of it. They are used in real life whenever a schedule of processes must be made while taking turns -Why aim for logn time / op? Proof that “a sequence of O(n) pq ops can be used to sort n numbers”: set up H by inserting each number with its value as key. Then simply extract min one by one. Thus n numbers times logn ops per number gives nlogn which is the best we can hope for at this point
Heap - balanced bin tree with a root, node with up to children, and child. Heap order means child is at least equal to or bigger than parent. We can use pointers or arrays to rep heaps-repping-pqs depending on if we know that total number of elements that will be in it at any one time. left child is 2i and right child is 2i + 1 where i is the index of any node.
-Implementing Heap Ops. Start heap n, insert logn, findmin C, delete logn, extractmin logn. Heapify up (H, i): (if i>1 j is parent, and if child smaller than parent swap up.recursive heapify up(H,j). Swapping is C. We claim that after each swap array H is either a heap of almost a heap with the key of H[j.
-The height of tree is the max number of recursions. Add new elements to the end of the array and then heapify up. To delete, we only allow deleting the root; replace the root with the last element and then heapify up or down
-Heapify down (H,i): (n = length(H). if 2i>n, done. else 2i<n set j to be the smaller of i's left and right child. else 2i=n, set j to 2i. endif if child is smaller than parent, swap them, then call heapify down (H, j)).
-Implementing PQs with Heaps (same as implementing heap ops mentioned above).
-A second category of ops is finding elements by name rather than pos (Does this mean value rather than key?). To do this have additional array Pos that stores the current pos of each el/node in the heap. To delete, it's Delete(H,Pos[v]) and to change key, find where v is in Pos, change the key and then use heapify up or down as appropriate
Algorithms apply not just to cs. They aren't just a matter of solving a problem, but are about how to solve a problem the most efficient way (what the authors are hoping this book will teach me). -Once you find the 'mathematically clean core of a problem', then design the appropriate structure to solve that problem. '[Algs don't] just provide solutions to well-posed problems; they form the language that lets you cleanly express the underlying questions.' So more than as a tool for problem solving, they allow elegant design of algs and then in turn programs.
Interest: 8/10 Gale and Shapley, two mathematicians/economists, wanted to see if any process involving an applicant and acceptor-or-rejector could be automated based on the preference ordering of both parties. It's tricky though: if self-interest allowed the parties to drop their current partner, chaos would ensue. This alg is in use today in hospitals. -Make it the 'bare-bones version', where we can assume things to go right (like for men and women, there are an even number of them); the goal is to have no instabilities, where E and A, or M and W, won't have motivation to ditch the other.
Pseudocode alg: All m w are free (with w having 2 arrays) While there is a free man who hasn't proposed to every woman (choose m choose highest w m hasn't proposed to yet if w is free, engage both else if w prefers someone else m is free, or if if w prefers m to current man they become engaged and her current m is free) return to set s of engaged pairs
Runtime can be n^2 if we set the w up with two arrays that are inverses of each other for the men they prefer. Otherwise it is n^3 (as shown in class). pg 7-12 shows that the solution is logically sound. Also see class notes for proofs (Professor, I'm not sure if saying things like the last two sentences are ok. I'm hoping they are for this one, but if not, then in the second journal entry I will type them out here)
Interest: 7/10
It's about being efficient, not just solving something. The way they're gonna measure efficiency is by run time, and then in terms of memory space. Efficiency: 'an alg is eff if, when implemented, it runs quickly on real input instances'; but more than that general definition, good algs need to be platform-independent, instance-ind, and of predictive values with respect to increasing input sizes. -In general, we start with looking at worst-case rt. The problem with average-case analysis is that 'random inputs' can not capture real, 'average' inputs. Second def of efficiency: …if it achieves qualitatively better worst-case performance, at an analytical level, than brute-force search. -Polynomial running times are characteristic of eff algs (definition 3 of eff). Pg 34 has a table from O(n) best, which is great rt, to n!, which is onion awful.
In general: n > nlogn > n^2 > n^3 > 1.5^n > 2^n > n! (pg 34)
Questions I have: just the notion of how to determine the O of an alg, because, say we need to do an operation to the xth object of some collection, sometimes we count getting to x as part of the runtime, while other times we just count O to be whatever we do once we're there already. How do I know when to count what?
Interest: 6/10 Analyzing algs is done by analyzing how fast the rt grows with each additional unit of input. The function f(n) gives a line repping the worst-case run time of some alg of input size n. 'Granularity' in order to be able to generalize improvements/solutions to inefficient algs. Pseudo-code is enough, no need to microscope an alg. O(n) reps asymptotic upper bounds, Ω(n) reps lower, and θ(n) reps the tight bounds (which is found if the upper and lower bounds are the same or very close to being the same). Asymp growth rate algs have transitivity and sum properties. -Tips on determining asymp bounds. Polynomial: ignore everything but the highest order power. Log: for b>1 and x>0, logbn = O(n^x). Exponentials are terrible and cannot be bounded by a fixed constant c like the previous two can.