Chapter 4 - Greedy Algorithms

Front Matter

Summary: In this prelude to Chapter 4 - Greedy Algorithms, the authors introduce us to the idea of a greedy algorithm: one that builds a solution incrementally by choosing the optimal step each time. They say the steps are chosen “myopically” (p 115), meaning that the steps lack intellectual insight. A greedy algorithm producing an optimal solution implies that there is a “local decision rule” for arriving at the optimal solution (p 115). Following this information, the chapter's contents are outlined: two methods for proving that a greedy algorithm give an optimal solution (greedy stays ahead and exchange argument) and applications of greedy algorithms.
My Questions: The authors mention applications where a greedy solution is close to optimal. I wonder in what scenarios an almost optimal solution would be satisfactory.
Second Time Around: It's hard to say that anything makes more sense the second time around since this wasn't a real section of the chapter.
Note to Self: I like the phrasing “local decision rule” for how a greedy algorithm works.
Readability: 10 - this was a pseudo section, so there wasn't anything technical to digest.

Section 4.1

Summary & Motivations: Section 4.1 is Interval Scheduling: The Greedy Algorithm Stays Ahead. In this section, we revisit one of the five “representative problems” from Chapter 1. In this type of problem, we have a set of n requests, each with a start time and an end time. We're trying to get the compatible set of maximum size, where compatible means that the requests in the subset do not overlap at all. For this algorithm, we select a request and reject all that are incompatible with it. We select the next, and so on, until there are no requests left to consider. Since this algorithm is greedy, though, we need a greedy rule for selecting the next request. We explore three rules that give suboptimal results: choosing minimal start time, choosing smallest interval, and choosing the fewest conflicts. We consider counterexamples for these three approaches, and then proceed to discussing the rule that will work: choosing earliest finish time. After exploring the algorithm's specific details, the authors provide a few ways to extend the interval scheduling problem: all requests are not known and decisions need to be made in real time or the requests have values associated with them and we want to maximize our profits. Finally, we explore a related problem in detail: scheduling all intervals from a set of requests using the fewest resources possible. This is also called the interval partitioning problem, and a prime example would be scheduling all classes into the fewest number of classrooms possible.
About the Algorithms: There are two algorithms explored in this section.
- Interval Scheduling: By choosing the earliest finish time, we “maximize the time left to satisfy other requests” (p 118). We solve this problem by maintaining two sets: R initialized to the set of all requests, and A initialized to empty. A shall be the set of accepted requests. While R is not empty, we add the request with earliest finish time to A, delete it from R, and delete from R all requests incompatible with that request. Once R is empty, we return set A. By definition of the algorithm, we know that it returns a compatible set. But we want to prove that it returns an optimal set. We do so by showing that the size of A (number of requests contained therein) is equal to the size of some optimal solution. We prove optimality using the greedy stays ahead method, and conclude that “for each r, the rth interval [the algorithm] selects finishes at least as soon as the rth interval in [the optimal solution]” (p 120). This algorithm has a runtime of O(n log n): sorting requests by finish time and labeling them as so takes O(n log n) time, constructing an array S for the start times s(i) takes O(n) time, and selecting requests based on the start time of the new request being >= the end time of the last request requires iterating through the requests once and spending constant time on them - that's O(n) time. Add that all up and the algorithm is O(n log n).
- Interval Partitioning: First, we define the depth of a set of intervals to be the maximum number of intervals that overlap. The optimal solution would be to use a number of resources exactly equal to the depth, and we show that our greedy algorithm does exactly that. We sort the intervals by start time, and then go through them in a single pass, assigning them to resources that have not yet already been assigned to intervals that overlap. This algorithm also has an O(n log n) runtime because we have to sort the intervals, which is the most costly step.
My Questions: I wonder why the authors failed to mention the runtime of interval partitioning.
Second Time Around: I appreciated that the authors broke down the runtime for the interval scheduling problem into runtimes for specific parts. We did that in class, too, but it's always nice when the textbook is explicit about things like that.
Note to Self: There can obviously be many ways to extend or tweak an algorithm, so be on the lookout.
Readability: 9 - the section was pretty straightforward, so it was hard to think of any questions.

Section 4.2

Summary & Motivations: Section 4.2 is Scheduling to Minimize Lateness: An Exchange Argument. Like the Interval Scheduling problem, this problem involves a single resources and a set of n requests. This time, though, the requests do not have specified start and end times. Rather, they have deadlines and durations (the algorithm must determine the start and end times), and can be completed at any time - but hopefully before their deadline. We want to find an ordering for the requests such that the maximum lateness is minimized. It's important to note that the latenesses do not add. Instead, we're considering the single request that has the greatest lateness, and we want to minimize that. We design a greedy algorithm for this problem. Two rules that fail are as follows: choose minimum length (order by increasing length) and choose minimum slack time (order by increasing slack time, di - ti).
About the Algorithm: The rule that works for this greedy algorithm is to choose minimum deadline (order by increasing deadline). First, we sort the jobs by deadline. We initialize the finish time to the start time. Then we go through the sorted jobs, assigning each job i to interval [f,f+ti], and updating f to f + ti (we keep track of the last finish time). We return the set of scheduled intervals. We use the exchange argument to prove optimality. The proof involves the concept of an inversion, “a job i with deadline di is scheduled before another job j with earlier deadline dj < di” (p 128), for the exchanges. Like the algorithms from the previous section, this algorithm is O(n log n) due to the sorting.
My Questions: I still do not quite understand why only adjacent inversions are considered in the proof. Why define them so generally and use them so specifically?
Second Time Around: Seeing the algorithm written in a different form helped me understand it better.
Note to Self: When using the exchange method, the hardest part of the proof can be showing that the exchanged solution is still optimal (p 129).
Readability: 7 - not terrible, but the notation for the proof of c starting on page 129 is a bit involved.

Section 4.4

Summary & Motivations: Section 4.4 is Shortest Paths in a Graph. We have a directed graph with edges of given lengths (or times or costs) and a starting node. This problem concerns finding the shortest path from the starting node to all other nodes in the graph, where the length of the path is the sum of the lengths of all edges in the path. If the graph were an undirected graph, we would replace each directed edge (u,v) by two directed edges (u,v) and (v,u) of the same length as the original directed edge. The greedy algorithm to solve this problem was proposed by Dijkstra, so the algorithm is named after him.
About the Algorithm: The algorithm keeps track of a set of nodes that have already been explored. The shortest path to them is already determined. We initialize the explored set to node s. Then while the explored set is not equal to the set of nodes in the graph, we select a node not yet in the explored set with at least one edge from the explored set such that the length of its path is minimum. We add that node to the explored set and update the distances to all nodes it points to, if the new value will decrease the distance. We continue until the explored set equals the set of nodes in the graph. We can produce the actual paths by keeping track of the edge (u,v) that gave node v its minimum value and backtrack from v to s. We use greedy stays ahead to prove the algorithm's optimality, and this algorithm has a O(m log n) runtime if we are clever about data structure usage (heap-based priority queue).
My Questions: Why, Kleinberg and Tardos, did you not explain the notation for the minimum path calculation (first seen as first thing at top of page 138)? That notation is definitely not clear, so I'm glad I also saw this in lecture. After revisiting the lecture slides, I know it means we pick the node with the minimum path distance to add to the explored set and update all path distances, if they're smaller than before, for nodes it points to. But that isn't super obvious just from the reading.
Second Time Around: The proof of 4.14, that the paths are indeed shortest paths made a bit more sense the second time around. With both the sentences reasoning the proof and the inequalities, it's clear that paths are shortest paths because there is no possible way to get to the node some other way with a shorter path, thanks to the fact that all paths have nonnegative length.
Note to Self: Remember that the data structures you choose can positively or negatively impact an algorithm's runtime. In this case, the runtime went from O(mn) to O(m log n) when a heap-based priority queue was used to keep track of the node with minimum distance.
Readability: 7 - see my question for why the score is a little low.

Section 4.5

Summary & motivations: Section 4.5 is The Minimum Spanning Tree Problem. The authors begin by telling us that this section uses an exchange argument to address algorithms for solving the “second fundamental problem on graphs: minimum spanning tree” (p142). A spanning tree is derived from a graph G = (V,E) by taking a subset of E such that the graph is still connected. A minimum spanning tree (MST) does so with the lowest cost. An example for where this algorithm is used is communication networks - the cable or phone company wants to connect all the houses, but using the least amount of wire. We learn that the minimum cost solution is a tree (4.16). If it were to contain a cycle, that would mean that two nodes u,v are connected in more than one way, so one of the ways can be deleted. Deleting one of the ways lowers the cost but leaves all the nodes still connected. So long as the graph is not simple, it could have exponentially many spanning trees, so we need an algorithm to find the minimum spanning tree. The section closes with examples of ways to extend the MST problem, but they are all much more computationally complicated than the version we study here.
About the Algorithms: We actually explore three greedy algorithms that correctly find a minimum spanning tree. To prove optimality of the algorithms, we need to have rules that tell us when an edge should or should not be in the MST. The Cut Property tells when an edge should be in the MST, and the proof for the Cut Property utilizes an exchange argument. Kruskal's and Prim's Algorithms use the Cut Property for their optimality proofs. The Cycle Property tells when an edge should not be in the MST, and the proof for the Cycle Property utilizes an exchange argument as well. The Reverse-Delete Algorithm uses the Cycle Property for its optimality proof.
- The first algorithm is Kruskal's Algorithm. You sort the edges by increasing cost and start with an empty tree. You add edges in increasing cost order so long as adding them does not create a cycle. If the addition of an edge would create a cycle, it's discarded/skipped. Once you go through all the edges you have the MST. This algorithm has a runtime of O(m log n) when using the right data structures. Implementation is left until Section 4.6.
- The next algorithm is Prim's Algorithm. This one is similar to Dijkstra's Algorithm. You start at a node and add edges that connect nodes to the partial tree as cheaply as possible. This algorithm, like Kruskal's Algorithm, has a runtime of O(m log n) when using the right data structures. Like Dijkstra's Algorithm, we use a heap-based priority queue to maintain the attachment costs. We have n ExtractMin operations and m ChangeKey operations, so the algorithm is O(m log n).
- The last algorithm is the Reverse-Delete Algorithm. This one can be thought of as the reverse of Kruskal's Algorithm. You start with G, and delete edges in order of decreasing cost (so you start with the most expensive edge) such that deleting the edge does not cause the graph to become disconnected. The runtime and implementation for the Reverse-Delete Algorithm is not discussed in the text because it is “difficult” (p 150).
My Questions: Well now I want to know how the Reverse-Delete Algorithm is implemented, since the text avoided discussing it.
Second Time Around: I think the Cut Property makes more sense the second time around. I was able to sit there and read page 145 a few times while flipping back and forth to Figure 4.10 until I followed all the variables at play. Same goes for the Cycle Property.
Note to Self: “Any algorithm that builds a spanning tree by repeatedly including edges when justified by the Cut Property and deleting edges when justified by the Cycle Property - in any order at all - will end up with a minimum spanning tree” (p 149). That's pretty powerful. Might come in handy for the assignment or a test.
Readability: 7 - I had to flip back and forth between text and figures a good bit, but I was able to follow everything with enough review.

Section 4.6

Summary & motivations: Section 4.6 is Implementing Kruskal's Algorithm: The Union-Find Data Structure. Here we explore the implementation that we left off in the previous section. It gets its own section, likely because it involves creating a whole new problem-specific data structure. This new data structure is called the Union-Find data structure. It allows us to quickly determine if the connected components of two nodes are the same or not and quickly merge/combine the connected components if they are different. The Union-Find data structure allows us to achieve the O(m log n) runtime for Kruskal's Algorithm we mentioned in the previous section. It's worth noting that the Union-Find data structure only works when adding edges, and not when deleting edges.
About the Algorithm: This is really “About the Data Structure,” but I digress. The Union-Find Data Structure has three operations: MakeUnionFind(S), Find(u), and Union(A,B). MakeUnionFind(S) returns a Union-Find Data Structure on S, meaning that every node has its own connected component to start. Goal runtime for MakeUnionFind(S) is O(n) where n is the number of nodes. Find(u) returns the name of the connected component of node u. Goal runtime for Find(u) is O(log n) time, but some implementations can have O(1) time. Union(A,B) merges the connected components with names A and B into one, updating the data structure in the process. Goal runtime for Union(A,B) is O(log n). We implement Kruskal's Algorithm by first calling MakeUnionFind(V) on V, the set of nodes in the graph, and then adding edges (u,v) in order of increasing cost such that Find(u) != Find(v) by calling Union(Find(u), Find(v)). We can use a “simple” Union-Find data structure, or we can use a “better” Union-Find data structure:
- In the “simple” version, we keep an array Component of size n to track the name of the connected component for each node. We set Component[s] = s for all nodes for MakeUnionFind with O(n) runtime, as desired. Because Component is an array, Find(v) is Component[v] with O(1) runtime, which is better than desired. We have to optimize to get a decent Union(A,B) runtime. We keep track of the list of nodes in each connected component explicitly and keep track of connected component sizes in an array Size of length n. By keeping track of the nodes explicitly, we don't have to iterate through the whole array to find nodes that get changed in a Union operation. By maintaining information about connected component size, we can transfer over the minimum possible number of nodes between connected components A and B in a Union. We show that k Union operations takes at most O(k log k) time.
- In the “better” version, we use pointers. For MakeUnionFind, each node points to itself (or the pointer is null), which is O(n) as desired. Names of connected components are just a node in that connected component. For Union(A,B), we must update the name of either A or B. Say the name of A is given by node v and the name of B is given by node u. We can merge the connected components by updating u's pointer to v. This implementation, though, causes Find(u) to run more slowly than in the “simple” version. Now we must trace through pointers to get the name of a node's connected component, since Union operations add on pointers. We show that this takes O(log n) time if we keep the larger connected component as the name when doing a Union. We maintain the extra connected component size information as an extra field in/on the nodes (where node here means a node and pointer node, not a graph node).
My Questions: I'd ask the authors why they describe the desired runtime for Union(A,B) as being O(log n) but then in the “simple” data structure explanation they land on an O(k log k) runtime for k Union operations and in the “better” explanation they land on O(1) runtime. Why would you say the goal runtime is x, but example 1 gives y and example 2 gives z?
Second Time Around: Since we kind of breezed through this topic in class, I'd say I understand it better after reading about it more in depth and reading about the implementation.
Note to Self: I've seen the compression idea somewhere else before. Could come up again later.
Readability: 8 - even though this section was pretty long, I think it was fairly easy to read.

Section 4.7

Summary & motivations: Section 4.5 is Clustering. The problem of clustering arises when we want to classify subsets of a collection into “coherent groups” (p 158). To do so, we have to have a way of quantifying how similar the entities are. That can be done with a distance function. Distances don't have to be physical distances, though. Rather, they're some measure for similarity. In the end, entities in the same cluster should be similar and entities in different clusters should be dissimilar. The flavor of clustering that we explore here is Clusterings of Maximum Spacing. Given a set of entities where we can determine the distance between any two entities, we seek to find a k-clustering (divide the entities into k clusters). The spacing of a k-clustering is “the minimum distance between any pair of points lying in different clusters” (p 158). Maximum spacing suggests that the individual k clusters are as far apart as possible. If there are exponentially many k-clusterings for a set of entities, is there an algorithm for finding one with max spacing?
About the Algorithm: The algorithm is quite simple. We just add edges in increasing distance order until we have k clusters. This algorithm is essentially Kruskal's Algorithm stopped “before it adds its last k - 1 edges” (p 159), which is the same as taking the MST produced by Kruskal's Algorithm and deleting the k - 1 most expensive edges. The text does not provide any further elaboration on this algorithm's implementation or runtime, which I assume is because it is so similar to Kruskal's Algorithm. I believe the runtime of this algorithm is O(m log n).
My Questions: Am I correct that the runtime is O(m log n)?
Second Time Around: Either we didn't finish the lecture on this in class or I'm experiencing amnesia. Either way, I did not know how to get the k-clustering of maximum spacing before reading this section, so that's a huge chunk of information that makes more sense after reading/that I know after reading.
Note to Self: Another example of using algorithms we already know in a new context.
Readability: I was ready to give this section a 9, but then I got to the proof for 4.26. I didn't process that fully the first time around, so I'm giving this section a 7 instead.

Section 4.8

Summary & Motivations: Section 4.8 is Huffman Codes and Data Compression. We explore encoding symbols of an alphabet using bits. We're converting an alphabet that's readable to humans into sequences of 0's and 1's that are readable to computers. A simple approach would be a fixed-size encoding. Let's say the alphabet we want to encode is 26 characters (a through z, presumably), the space, the comma, the period, the question mark, the exclamation point, and the apostrophe. Since we can form 2^b different symbols of b bits, we can use 5 bits per symbol for 2^5 = 32 symbols. The ASCII code does this, just with a larger alphabet, so there are more bits per symbol. A better approach would be to determine the length of each symbol's encoding based on the symbol's frequency in the language, and hope to use fewer than an average of b bits per symbol (from 2^b = number of symbols). Even low-tech Morse code does this! In Morse code, higher frequency letters use fewer dots/dashes. Morse code, however, actually uses a three-letter alphabet because pauses are required between symbols to avoid ambiguity. Ambiguity would arise because some symbols have encodings that are prefixes of other symbols' encodings. We can solve this problem by disallowing any individual encodings to be the prefix of any other individual encodings. Under this regime, the encodings are called “prefix codes.” Furthermore, if we have a frequency for each letter x, equal to “the fraction of letters in the text that are equal to x” (p 165), we can determine the average bits per letter of an encoding as the sum over all symbols of each symbol's frequency times the length of its encoding. Our goal is to find an algorithm that minimizes the ABL of an alphabet, given the alphabet and a set of frequencies for the alphabet's symbols.
About the Algorithm: We employ a binary tree to represent the prefix codes, since that's easier to visualize than a list of function mappings. If we only allow the leaves of this binary tree to represent the symbols in the alphabet, the codes are naturally prefix codes: there is not more than one way to arrive at a single leaf in a binary tree. Furthermore, we say that going left equates to a 0 and going right equates to a 1. The binary tree that gives an optimal prefix code is full, because if there were a node with only one child, you could simply replace that node with its one child, thus shortening the code. We use the knowledge that leaves of minimum depth in the tree should correspond to letters of maximum frequency to then conclude that leaves of maximum depth should correspond to letters of minimum frequency. From there, we formulate an algorithm to determine the optimal prefix code. We simply pick the two lowest-frequency letters y* and z* in the alphabet S and combine them into a “meta-letter” with frequency f_{y*} + f_{z*}. Then we “recursively find a prefix code for the smaller alphabet” (p 172), where the alphabet is now one letter smaller. When we get to the point where the alphabet only has two letters, we arbitrarily assign one 0 and the other 1. We can form the prefix code by “unfold[ing] the result back through the recursive calls” (p 173). If we maintain a heap-based priority queue using each letter's frequency as its key, this algorithm is O(n log n), where n is the number of letters in the alphabet.
My Questions: I find the term “prefix code” confusing. To me, it sounds like a “prefix code” would be the prefix of another code, rather than it strictly not being the prefix of another code.
Second Time Around: Something that actually made less sense the second time around: the algorithm itself. I didn't really understand what it meant by separating the “Define a prefix code for S as follows” block from the rest.
Note to Self: Approaching a problem top-down vs bottom-up can affect optimality, as in the case of Shannon-Fano codes vs Huffman codes.
Readability: 8 - I almost gave this section a 9 or 10, but the algorithm made less sense in the reading than it did in class.

Table of Contents

Chapter 4 - Greedy Algorithms

Front Matter

Section 4.1

Section 4.2

Section 4.4

Section 4.5

Section 4.6

Section 4.7

Section 4.8