These are some of the algorithms and data structures portions of old MSCS comprehensive exams. See the file "read-me.txt" for more information. Data Structures and Algorithms [F89] Each problem is worth 25 points. Your score for this section will be the sum of the scores of your best four answers. 1. (a) Construct a 2-3 tree (that is, a B-tree of order 3) with two levels containing the maximum number of keys. (b) Insert a key into the B-tree of (a) that is larger than any key already in the tree. (c) If you delete the key inserted in (b) from the tree constructed in (b), do you get the tree of (a)? (d) Delete the two smallest keys from the tree of (a). 2. Use the Huffman algorithm to find a minimum-length encoding for a set of eight characters whose probabilities of occurrence (measured in percent) are 2, 3, 6, 8, 15, 20, 21, and 25. Express each of the eight codes as an explicit sequence of bits. 3. What are the disadvantages of best-fit compared to first-fit as a strategy for searching a free space list of blocks of various sizes? How can these disadvantages be reduced by combining the two strategies? 4. What's wrong with the following attempt at analyzing successful search in a hash table using chaining? Let a = n/b, where n is the number of keys in the table and b is the number of buckets. The average chain contains a items, so the average time to inspect the average chain is (1 + a)/2. 5. (a) Show that for an input sequence whose length n is a power of 2, if T(n) is the number of key comparisons required by mergesort, then T(n) <= en 4- T(n/2) and T(l)=0. (b) Show that the recurrence for T given above implies that T(n) <= en log n, where the log is taken to the base 2. (c) Show that T(n) is 0(n log n) even when n is not a power of 2. Comprehensive Exam Data Structures Spring 89 1. identify the tree, forward, back, and cross arcs of G (assume that nodes are alphabetically ordered, and that the arcs of G have the labels given in the adjacency matrix below): a b c d e f ---------------- a: - j h l - - b: i - - - k - c: - - - q - m d: - - - - n o e: - - - - - p f: - - - - - - 2. Compute the worst case running time of the following program: function foo(n: integer): integer; begin if n = 0 then foo := i else foo := foo(n / 2) + 2 end; 3. What 1s the "bunching up" problem associated with linear rehashing and what can be done to avoid it? 4. Recall that Kruskal's algorithm empioys the greedy method for finding the cheapest spanning tree of a weighted graph G, Modify this algorithm to fine a heuristic solution to the Traveling Salesman Problem. Why is this algorithm oniy a heuristic solution? 5. Draw a 2-3 tree representing the set {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] Show the steps involved in deleting 7 from this structure. PART I: Data Structures [S83] Do 3 out of 5. 1. Define garbage collection and compaction and describe methods for their implementation. What are the advantages and disadvantages of garbage collection compared to the use of reference counts? 2. Describe as many methods as you can of representing the data structure "set". In each case, discuss hew the union operation could be efficiently performed. 3. Discuss the application of hashing to file organization. Include a discussion of how buckets could be organized and of collision resolution policy. 4. Briefly describe the following sorting methods: Quicksort, Bubblesort, Heapsort. Conpare these three methods. What are the advantages and disadvantages of each? 5. What are threaded trees? Hew do they differ from other trees? Give an algorithm for inorder traversal using threaded trees. MSCS Comprehensive Exam Data Structures and Algorithms Each problem is worth 25 points. Your total score will be the sum of the scores of your four best answers. 1. For a B-tree of order m with n keys, (a) What is the worst-case height? (b) What is the worst-case time for using binary search to find a key within a node? (c) Using (a) and (b), what is the worst-case time to search for a key in a B-tree? How does this compare with other balanced search trees? Why might B-trees be superior? Asymptotic analysis (using 0-notation) is sufficient for this problem. 2. Consider an abstract data type with operations INSERT, FIND-MEDIAN, and DELETE-MEDIAN. Show how such a data type can be implemented in terms of two heaps, one a max-heap and the other a min-heap. Assume that the median of a set of size 2m is the mth largest element. 3. Trace the operation of Dijkstra's algorithm in finding the shortest path from vertex 1 to each other vertex in the graph with vertex set {1,2,3,4,5} and cost matrix - 3 1 9 7 9 - 5 5 6 8 4 - 8 4 5 1 4 - 2 6 3 2 4 - 4. Let A(h) be the maximum number of keys in an AVL tree of height h. (a) Give an lower bound for A(h) in terms of A(h-l) and A(h-2). (b) Show formality (i.e., by using recurrence relations, generating functions, or induction) that A(h) is 0mega(b**h), (i.e., that b**h is 0(A(h))) for some b>l. (c) How does the result of (b) show that AVL trees have logarithmic height? 5. In the cases of (a) chaining and (b) linear probing, describe briefly how you would implement the operation of deletion from a hash table. Also describe any changes you need to make in the insertion and search operations as a result. Data Structures and Algorithms Each problem is worth 25 points. Your score for this section will be the sum of the scores of your best four answers. 1. (a) Construct a 2-3 tree (that is, a B-tree of order 3) with two levels containing the maximum number of keys. (b) Insert a key into the B-tree of (a) that is larger than any key already in the tree. (c) If you delete the key inserted in (b) from the tree constructed in (b), do you get the tree of (a)? (d) Delete the two smallest keys from the tree of (a). 2. Use the Huffman algorithm to find a minimum-length encoding for a set of eight characters whose probabilities of occurrence (measured in percent) are 2, 3, 6, 8, 15, 20, 21, and 25. Express each of the eight codes as an explicit sequence of bits. 3. What are the disadvantages of best-fit compared to first-fit as a strategy for searching a free space list of blocks of various sizes? How can these disadvantages be reduced by combining the two strategies? 4. What's wrong with the following attempt at analyzing successful search in a hash table using chaining? Let a = n/b, where n is the number of keys in the table and b is the number of buckets. The average chain contains a items, so the average time to inspect the average chain is (1 + a)/2. 5. (a) Show that for an input sequence whose length n is a power of 2, if T(n) is the number of key comparisons required by mergesort, then T(n) <= cn + T(n/2) and T(l)=0. (b) Show that the recurrence for T given above implies that T(n) <= cn log n, where the log is taken to the base 2. (c) Show that T(n) is 0(n log n) even when n is not a power of 2. 1. Suppose that a binary search tree has 100 nodes and an internal date path of length 1000. What is its external path length? What would the expected successful search time in the tree be? What would the unsuccessful search time in the tree be? What assumptions are you making in these last two computations? 2. Sketch an algorithm for finding the successor in inorder of a node in a binary search tree. You may assume that the node has a right child. Is it possible to find the successor of such a node given only a pointer to the node? Show that the successor node cannot have two children. One strategy for deletion in a binary search tree involves replacing a node by its inorder successor and then deleting the successor. Why is the assumption that the node whose successor is to be found has two children relevant here? 3. Find the mistake in the following argument and correct it: In hashing with chaining (the use of linked lists to represent buckets), the average number of key comparisons needed to search a chain of length s is (1 + s)/2. If there are n keys and b buckets, then the average chain length is just the load factor a=n/b, so the average-case successful search time measured in number of key comparisons is (1 + a)/2. 4. Suppose that a directed graph G is represented by an adjacency matrix. Describe an algorithm for determining, for all vertices v and w in G, whether there is a (directed) path from v to w. What is the time complexity of your algorithm? Justify your analysis. Give an example. 5. Recall that in Floyd's algorithm for the all-pairs shortest paths problem, the matrix A[k,*,*] was computed via A[k,i,j] = min {A[k-1,i,j], A[k-1,i,k]+A[k-1,k,j]}. Why is it possible to use only one array to represent all the matrices A[k,*,*] for different k? Comprehensive Examination - Analysis of Algorithms Page 1 of 5 1) Show that the GRAPH ISOMORPHISM problem is in the class NP. Recall that an instance of the problem is a pair of graphs; the question is whether there is a 1-1 correspondence between the vertex sets that preserves incidence. 2) Show that the PARTITION problem is polynomially reducible to the ZERO-ONE KNAPSACK problem. Recall that an instance of the knapsack (decision) problem consists of a capacity M, a bound B, a sequence (w[i]) of weights and a sequence (p[i]) of profits. The question is whether there is a sequence (x[i]) of elements such that each x[i] is in {0,l}, the sum of the products x[i]w[i] is at most M, and the sum of the products x[i]p[i] is at least B. An instance of the partition problem is a set S of n integers; the question is whether there is a subset T of S such that the sum of the integers in T equals the sum of the integers in S-T. 3) (a) (15 points) Sketch a parallel algorithm for an n-processor system that finds in 0(log n) time the maximum of n integers, where each integer is stored in a separate processor. What assumptions are you making about the architecture of the system? (b) (5 points) Does your answer for (a) have time complexity 0(log n) even if the number N of processors is greater than the number n of integers? If not, sketch an algorithm that does. 4) Consider the problem of formatting a paragraph of text, in full justification. Let us make a number of simplifying assumptions, (i) We use a font in which all characters (and spaces) have the same size. (ii) The last line of the paragraph shall stretch all the way to the right margin, just like all other lines in the paragraph. Let M be the maximum number of characters and spaces in a line. We are given a paragraph of text containing the words w[1], ..., w[n] of length l[1], ..., l[n] that need to be formatted in lines of length M, separated by spaces and with additional spaces added for full justification. We assume l[i] <= M for every i. Experience shows that large gaps between words are perceived as much uglier than small gaps. We define the badness of a line containing the words w[1], ..., w[j] to be b(i,j) = (M - l[1] - ... - l[j] - j + i)**2 if l[1] + ... + l[j] + j - i <= M and infinity otherwise i.e. the square of the number of excess spaces in the line. (Note that j-i spaces are necessary to separate the words, and that the badness is infinity if the words do not fit on the line.) The badness of the paragraph is the sum of the badnesses of its lines. The optimization problem is to find line breaks that yield a paragraph with minimal badness. The following example shows that the first fit algorithm "cram as many words as possible onto the first line, then move on to the next" does not yield an optimal solution. Example. Let M = 25. The text is: Approximate solutions of the determinants are computable efficiently. Here is the first fit solution (^ = space, lines are followed by their badness values) Approximatea^solutions^of 1 the^^^^determinants^^^are 25 computable^^^efficiently. 4 Moving the "of" to the next line reduces the total badness. Approximate^^^^^solutions 16 of^^the^^determinants^are 4 computable^^^efficiently. 4 (a) Give a dynamic programming algorithm to find an optimal solution. Describe how your algorithm produces the set of line breaks. [10 points] (b) Apply your algorithm to the example above. [5 points] (c) Estimate the time and space complexity of your algorithm. [5 points] 5) In this problem, you may assume the known fact that 0(b) operations are required to add or subtract two numbers with b bits, and 0(b**2) operations are required to multiply them as well as to compute their integer quotient and remainder. Consider the following version of Euclid's algorithm to compute the gcd of two b-bit numbers: gcd( x, y ) == gcd( x, -y ) if y < 0 |x| if y = 0 gcd( y, x mod y) if 0 < y <= x/2 gcd( x, x-y) if x/2 < y <= x gcd( y, x ) if y > x (a) Show that this algorithm terminates for arbitrary (positive or negative) inputs x, y. [10 points] (b) Estimate the running time for this algorithm for b-bit inputs x, y in terms of b. Hint: Show first that after a bounded number of recursive calls the algorithm computes gcd( x, y ) == gcd( z, w ) with z < x/2. [10 points] PART III. DATA STRUCTURES 1. a) Construct a B-tree of order 2 from the following set of keys: 1 24 61 53 24 19 17 14 16 43 b) Delete the root. c) What is the expected height of a B-tree of order m of N nodes? What is the best case height? The worst case height? d) Describe an application of B-trees. Why would you use a B tree in your chosen application instead of a binary tree? 2. Sorting Methods: Choose two sorting methods from heapsort, select sort, and quick sort. (Some texts refer to quicksort as "partition sort.") a) Write pseudo-Pascal (or pseudo Algol or pseudo-Lisp) programs to sort N keys using your chosen methods. b) Compute the total cost of sorting N keys using your methods. Please do this for both the "average" case and the worst case. 3. a) List and describe three uses of stacks in computers. b) How can space be utilized efficiently in a memory in which two stacks are existing simultaneously? c) How can space be allocated efficiently for a memory when N stacks are used simultaneously? d) What is the cost of reorganizing memory when one of the N stacks in part (c) overflows? e) How often do you expect an overflow? 4. a) What is an optimal search tree? b) Construct an optimal search tree for the following keys with the given frequency distribution: A = .4 B = .2 C = .3 D = .1 c) Prove that your answer to (b) is correct. 5. a) Define: binary tree traversal of a binary tree preorder, inorder, postorder, traversal b) Give the sequences of keys in a preorder, inorder and postorder traversal of A / \ B C / \ / \ D E F G c) Write a recursive program to perform an inorder traversal of a binary tree. d) Give the parse tree for the algebraic expression (a+b)*(c+(d*c)). e) Change the expression in d) to an equivalent expression in prefix notation. Fall 1992 Comprehensive Exam-Data Structures and Algorithms (1) In this problem, we consider Huffman codes. Suppose six letters can occur with the following frequencies: letter frequency e .45 d .16 l .13 o .12 h .09 w .05 (a) Use Huffman's algorithm to construct a code tree. (b) Encode the word hello". (2) Consider a hash table for closed hashing with 17 elements. We hash nonnegative integer values, using as hash function h(x) = (sum of the digits of x) mod 17. For example, h( 795 ) = (7+9+5) mod 17 = 4. (a) Perform the following operations, using linear probing to resolve collisions. Show the table after each operation, using -1 for "empty" and -2 for "deleted" locations. Insert 725 Insert 536 Insert 347 Remove 536 Insert 158 (b) Explain the phenomenon of "clustering" or "bunching up", using the following sequence of insertions: Insert 725 Insert 536 Insert 277 Insert 347 Insert 655 Insert 158 (c) Show that bunching up disappears in this example when the quadratic probing function is used: h(i,x) = h(x) +- i**2 That is, if h(x) is occupied, try h(1,x), then h(2,x) etc. (3) In this problem, we consider Hoare's quicksort algorithm. Always use the first element as the "pivot" to partition a subinterval. Intervals of length two can be sorted manually. (a) Show how the quicksort algorithm sorts the sequence of numbers 5 3 2 6 4 1 3 7 Explain your steps (i.e. indicate partitioning and recursive calls). (b) What is the worst case runtime performance of quicksort? For an arbitrary n, indicate a sequence of n numbers that makes quicksort run in worst time. (4) Floyd's "all-pairs shortest path" algorithm accepts graphs with negative edge weights as long as there is no negative weight cycle. (a) Perform Floyd's algorithm on the directed graph with incidence matrix 0 * 4 * -2 -1 0 * * * * 5 0 6 * * * * 0 7 * 3 * * 0 where * represents an infinite weight (b) Why can the algorithm not work if the graph has a negative weight cycle? (c) Suppose a negative weight cycle is present. What effect does that have on the c(i,j,k) matrices in Floyd's algorithm? (If you don't know the answer, you can get a hint by replacing the entry 3 in the above matrix with -3 and running a few steps of the algorithm). (5) Consider a complete undirected graph G = {v[1], v[2], ..., v[n]}, in which each edge {v[i], v[j]} has a positive weight w[i,j]. The "Travelling Salesman Problem" (TSP) is to find a cycle C that includes all vertices exactly once with minimal total weight w(C), defined as the sum over e in C of w(e). In general, the TSP is a computationally hard problem. G is said to have the "triangle property" if for any triangle of vertices v[i], v[j], v[k], w[i,k] <= w[i,j] + w[j,k] (i.e. the cost of the direct way from v[i] to v[k] is no longer than the cost of the detour through v[j]). For these graphs, a fast algorithm exists that yields an approximate solution to the TSP. This is the algorithm: (i) Find a minimum weight spanning tree of G. (ii) Turn the tree into a cycle by "walking around it". This cycle traverses each vertex twice. (iii) Remove duplicate vertices from that cycle by exploiting the triangle prperty. (That is, rather than walking back to a previously encountered vertex, go to the next unused one in preorder traversal. The resulting cycle without duplicate vertices is the approximate TSP solution. Problems: (a) Perform this algorithm with the graph with incidence matrix below. 0 2 4 3 1 2 0 4 5 3 4 4 0 2 4 3 5 2 0 4 1 3 4 4 0 (b) Show that, while this algorithm does not necessarily return an optimum solution of the TSP, that the cost w(C) of the computed cycle C is no more than twice the cost of an optimum cycle. (c) What is the time complexity of this algorithm, in terms of the number of vertices n? COMPREHENSIVE EXAMINATION on ANALYSIS OF ALGORITHMS October 3, 1992 1) (a) Show that the problem of determining whether an integer is composite is in NP. (b) What's wrong with the following argument, purporting to show that determining whether an integer is composite is in P? Let n be an integer. For each integer k less than n, divide k into n. Return "yes" if some k divides n evenly, otherwise return "no". Since division is a polynomial-time algorithm and there are only 0(n) divisors to test, the overall algorithm has polynomial-time complexity. 2) Consider the graph with adjacency matrix below. This graph satisfies the triangle inequality. 0 2 4 3 1 2 0 4 5 3 4 4 0 2 4 3 5 2 0 4 1 3 4 4 0 a) Sketch the graph. b) Find a minimum-cost spanning tree for the graph. c) Show that there is an algorithm that for any graph satisfying the triangle inequality can construct an approximate solution to the traveling salesperson problem (TSP) whose cost is no more than twice the optimal cost. d) Show that the algorithm of (c) is in the class P. e) Apply the algorithm of (c) to the graph of (a) and (b). 3) (a) What lower bound does the decision tree model give for the worst-case behavior of sorting algorithms? Justify your answer. (b) The radix sort algorithm can sort numbers of bounded size in time Theta(n). Why does this not violate the lower bound of (a)? 4) (a) Show the result of inserting each element in the sequence (8 5 4 9 1 7 6 3 2) into an initially empty heap. Assume that the largest key is to be in the root. (b) Show the result of applying the heapify algorithm to the sequence of (a). (c) What are the time complexities of the algorithms used in (a) and (b)? (d) Give informal justifications of your answers to (c). 5) (a) Fill in the blanks in the B-tree deletion algorithm below. Note that one of the three cases must apply. Here a poor node is one with the minimal number of keys; all other nodes are rich. You may assume that the key to be deleted appears at the lowest level, and that each key appears only once in the B-tree. If the node N to be deleted from is rich, then ... Else if N has an adjacent sibling that is rich, then ... Else if all adjacent siblings of N are poor, then ... (b) Use your algorithm to delete first 52, then 47, and finally 33 from the B-tree of order 3 below. (c) Show that in each of the three cases the tree obtained by applying the deletion algorithm you gave in (a) satisfies the B-tree properties that (i) every node (except at the lowest level) has one more child than key (ii) every node has at most n-1 keys (iii) every node (except perhaps for the root) has at least n/2 - l keys. (30) / \ (15) (45) / \ / \ (6) (19) (36 40) (50) /\ /\ / | \ /\ (1) (8) (17) (20) (33) (38) (42) (47) (52 60)