We have seen how greedy algorithms use a greedy rule to shrink the size of a problem instance, in order to solve a smaller problem by recursion. The smaller instance leads to an optimal solution for the original instance , but the global consequences do not become fully apparent until the full recursion is complete. The Problem. Encoding schemes that take text written in richer alphabets and convert this text into long strings of bits. It would be simple to use a fixed number of bits for each symbol in the alphabet, and just concatenate the bit strings of each alphabet that make up a text. You can form 2^b different sequences out of b bits, and so if we use 5bits per symbol, we have 32 different combinations of 0 and 1. 4 bits per symbol would not be enough because we have 26 alphabets and other symbols like a period. Its better to to use small number of bits for most frequent letters and larger number of bits for larger ones and hope to use fewer bits than using a fixed-length encoding. Variable-length Encoding Schemes. Morse code was used before the introduction to internet, digital computing , and other advanced comm. mechanisms; when telegraphs were used. Morse code translated each letter into a sequence of dots(short pulses) and dashes(long pulses). The ambiguity of a string of 0 and 1 have different interpretations was solved by using pauses in between characters which increases the number of bits used. Prefix codes were thus introduced to solve this problem. Prefix code basically tries to prevent any characters to be a prefix of any other, a problem which Morse Code brought up. To construct text from an encoding resulting from a prefix code function, we use these following rules:
- Scan the string from left to right
- If you see enough bits to make a character then change
- delete the set from bits from string.
Optimal prefix codes. We can use the average number of bits required per letter, ABL instead of using a fixed-length encoding to use lesser encoding. Designing the Algorithm. It is useful to develop a tree-based means of representing prefix codes that expose their structure more clearly than simply the lists of function values we used in our previous examples. We shall use a binary tree to represent prefix codes, where the number of leaves = number of alphabets, and this can work in the other direction as well. Encodings that start with a 0 are on the right and those with a 1 are on the left. We can have two identical prefix codes in structure, and the difference being the labeling of the leaves. The length of the path for every character equals the length of its encoding. Top-Down Approach. Our goal is to produce a labeled binary tree in which the leaves are as close to the root as possible. We split S into two components or as nearly balanced as possible and construct prefix codes and make these two subtrees of the root. There is no guarantee a top-down strategy will always produce an optimal prefix code. David Huffman managed to compute the optimal code without a brute-force search. The Huffman code for a given alphabet achieves the minimum average number of bits per letter of any prefix code, and this is proven on page 175. But this optimality result does not imply by any means that we have found the best way to compress data under all circumstances.
This section contained a lot of material but was not conceptually difficult. Its scale would be an 8/10. An algorithm to construct an optimal prefix code is on Page 172.