Sorting

  1. Introduction One of the most common applications in computer science is sorting, the process through which data are arranged according to their values.
  2. Sort Classifications Sorts are generally classified as either internal or external sorts. An internal sort is a sort in which all of the data are held in primary memory during the sorting process. An external sort uses primary memory for the data currently being sorted and secondary storage for any data that will not fit in primary memory.
    1. Internal
      1. Insertion Insertion sorting is one of the most common sorting techniques used by card players. As they pick up each card, they insert it into the proper sequence in their hand. The concept extends well into computer sorting. In each pass of an insertion sort, one or more pieces of data are inserted into their correct location in an ordered list. In this section we study two insertion sorts, the straight insertion sort and the shell sort.
        1. Insertion In the straight insertion sort, the list is divided into two parts: sorted and unsorted. In each pass, the first element of the unsorted sublist is transferred to the sorted sublist by inserting it at the appropriate place. If we have a list of n elements, it will take at most n-1 passes to sort the data. The insertion sort efficiency is O(n^2).
        2. Shell The shell sort algorithm, named after its creator, Donald L. Shell, is an improved version of the straight insertion sort in which the diminishing partitions are used to sort the data. In the shell sort, given a list of N elements, the list is divided into k segments, where k is known as the increment. Each segment contains [n/k] or less elements. After each pass through the data, the increment is reduced until, in the final pass, it is one. For example, 5,2,and 1, or 7,3,2,1. Example (K=5, 3,1) k=5 1 2 3 4 5 6 7 8 9 10 77 62 14 9 30 21 80 25 70 55 -- -- -- -- -- -- - -- -- -- 21 77 62 80 14 25 9 70 30 55 k=3 21 62 14 9 30 77 80 25 70 55 -- -- -- -- -- -- -- -- -- -- 21 62 14 9 30 77 80 25 70 55 9 62 14 21 9 30 14 21 62 9 30 14 21 62 77 9 30 14 21 62 77 80 9 25 14 21 30 77 80 62 9 25 14 21 30 70 80 62 77 9 25 14 21 30 70 55 62 77 80 k=1 9 25 14 21 30 70 55 62 77 80 9 9 25 9 14 25 9 14 21 25 9 14 21 25 30 9 14 21 25 30 70 9 14 21 25 30 55 70 9 14 21 25 30 55 62 70 9 14 21 25 30 55 62 70 77 9 14 21 25 30 55 62 70 77 80 Selecting the increment size First, let's recognize that there is not an increment size that is best for all situations. Knuth suggests, however, that you should not start with an increment greater than one-third of the list size. Other computer scientists have suggested that the increments be a power of two minus one or a Fibonacci series. Knuth tells us that the sort effort for the shell sort cannot be determined mathematically. He estimates from his empirical studies that the average sort effort is 15 n^1.25. Reducing Knuth's analysis to a Big-O notation, we see that the shell sort is O(n^1.25).
      2. Selection Selection sorts are among the most intuitive of all sorts. Given a list of data to be sorted, we simply select the smallest item and place it in a sorted list. These steps are then repeated until all of the data have been sorted. In this section we study two selection sorts, the straight selection sort and the heap sort.
        1. Selection In the straight selection sort, the list at any moment is divided into two sublists, sorted and unsorted, which are divided by an imaginary wall. We select the smallest element from the unsorted sublist and exchange it with the element at the beginning of the unsorted data. After each selection and exchange, the wall between the two sublists moves one element, increasing the number of sorted elements and decreasing the number of unsorted ones. Each time we move one element from the unsorted sublist to the sorted sublist, we say that we have completed one sort pass. If we have a list of n elements, therefore, we need n-1 passes to completely rearrange the data. In each pass of the selection sort, the smallest element is selected from the unsorted sublist and exchanged with the element at the beginning of the unsorted list. Note: You may find the largest and move to the right side of the list. This way you scan the list right to left. The straight selection sort efficiency is O(n^2).
        2. Heap A heap is a tree structure in which the root contains the largest (or smallest) element in the tree. The heap sort algorithm is an improved version of the selection sort in which the largest element (the root) is selected and exchanged with the last element in the unsorted list. Heap sort begins by turning the array to be sorted into a heap. This is done only once for each sort. We then exchange the root, which is the largest element in the heap, with the last element in the unsorted list. This exchange results in the largest element being added to the heap and exchange again. The reheap and exchange process continues until the entire list is sorted. Following the branches of a binary tree from a root to a leaf requires log N loops.The sort effort, the outer loop times the inner loop, for the heap sort is therefore n(log2 n). When we include the processing to create the original heap, the Big-O notation is the same. Creating the heap requires nlog2n loops through the data. When factored into the sort effort, it becomes a coefficient, which is then dropped to determine the final sort effort. The heap sort efficiency is O(nlog2n).
      3. Exchange The third category of sorts, exchange sorting, contains the most common sort taught in computer science, the bubble sort, and the most efficient general purpose sort, quick sort.
        1. Bubble In the bubble sort, the list at any moment is divided into two sublists: sorted and unsorted. The smallest(largest)) element is bubbled from the unsorted sublist and moved to the sorted sublist. After moving the smallest (largest) to the sorted list, the wall moves one element to the right, increasing the number of sorted elements and decreasing the number of unsorted ones. Each time we move one element from the unsorted sublist to the sorted sublist, we say that we have completed one sort pass. If we have a list of n elements, therefore, we need n-1 passes to completely rearrange the data. The bubble sort efficiency is N(N-1)/2 The bubble sort efficiency is O(n^2).
        2. Bubble Sort with a Flag It is a variation of bubble sort. It stops after we make a pass through the list without any exchange. It gives butter performance when the original list is sorted or partially sorted. Its performance ranges from O(N) to O(N^2)
        3. Quick Quick sort is an exchange sort in which a pivot key is placed in its correct position in the array while rearranging other elements widely dispersed across the list. It is a divide-and-conquer sorting method that runs in average time O(nlog n). The main theme behind Quicksort is as follows: You first choose some key in the array A as a pivot key. This pivot key is used to separate the keys in A into two partitions: (1) A left partition containing keys less than or equal to the pivot key, and (2) a right partition containing keys greater than or equal to the pivot key. Quicksort is then applied recursively to sort the left and right partitions. Strategy for Quicksort Procedure Quicksort (var: A:SortingArray; m,n:Integer); Begin If {there is more than one key to sort in A [m..n] then begin Partition A[m..n] into a LeftPartiton and RightPartition using one of the keys in A[m..n] as a pivot key. Quicksort the LeftPartition Quicksort the RightPartiton end {if} end {Quicksort}; Pascal code for Quicksort Procedure Quicksort (var: A:SortingArray; m,n:Integer); Var i, j: integer; Begin If m < n then begin i := m; j := n; Partition(A,i,j); Quicksort(A,m,j); Quicksort(A,i,n); end {if} end {Quicksort}; Methods in selecting the pivot key The Worst Case for QuickSort The best case for Quicksort The average case for quicksort
        4. Internal Merge Sort MergeSort divides the list into two sub-lists and combines the sorted sublists by merging them together into a single list. The number of levels is the ceiling of (log N). The minimum number of comparisons to merge two lists is N/2 The maximum number of comparisons to merge two lists is N-1. The maximum number of comparisons to sort the list is Log N * (N-1). Merge sort runs in time O(NlogN). Array implementation ... > Extra space and copying List implementation ...> No extra space, but extra time to divide.
        5. RadixSort A formal sorting algorithm was first devised for use with punched cards. The idea is to consider the key one character at a time and to devide the entries into as many sublists as there are possibilities for the given character from the key. If our keys, for example, are words or other alphabetic strings, then we devide the list into 26 sublists at each stage. That is, we set up a table of 26 lists and distribute the entries into the lists according to one of the characters in the key. Example, Initial list Sorted by l Sorted by Sorted by letter 3 letter 2 letter 1 rat mop map car mop map rap cat cat top car cot map rap tar map car car rat mop top tar cat rap cot rat mop rat tar cat top tar rap cot cot top RadixSort is an O(n) sorting process because it makes exactly k linear passes through the list of n keys when keys have k digits (letters).
        6. Proxmap Sort In proxmap sorting, you compute a "proximity map," or proxmap for short, which indicates, for each key, k, the beginning of a subarray of the array A in which K will reside in final sorted order. Let's proceed by example in order to help reveal the main ideas. Example, Initial unsorted list i = 1 2 3 4 5 6 7 8 9 10 11 12 13 A[i] = [6.7 5.9 8.4 1.2 7.3 3.7 11.5 1.1 4.8 0.4 10.5 6.1 1.8 ] Step1: Map the keys to an index array as follows: Mapkey (K) = ceiling (K). For example, Mapkey (3.7) >> 4 If we were to use i:= Mapkey(K) to send K into a location, A[i], in array A where we kept a linked list of keys, sorted in ascending order, we could scan through the original list, A, and send its keys into a collection of sorted linked lists, as shown. i = 1 2 3 4 5 6 7 8 9 10 11 12 13 0.4 1.1 3.7 4.8 5.9 6.1 7.3 8.4 10.5 11.5 1.2 6.7 1.8 Step 2: Compute hit counts, H[i], for each position, i, in A For i := 1 to 13 do begin j := Mapkey (A[i]); H[j] := H[j] + 1 end i = 1 2 3 4 5 6 7 8 9 10 11 12 13 A[i] = [6.7 5.9 8.4 1.2 7.3 3.7 11.5 1.1 4.8 0.4 10.5 6.1 1.8 ] H[i] = 1 3 0 1 1 1 2 1 1 0 1 1 0 Note: Look at the index array above. Step 3: Computing the Proxmap From the hit counts, H[i], we compute a proxmap, P[i], where each entry P[i] gives the location of the beginning of the future reserved subarray of A that will contain keys, K, mapping to location, i, under the mapping Mapkey(K) = i. {Convert hitcounts to a proxmap} RunningTotal := 1; For i := 1 to 13 do begin If H[i] > 0 then begin P[i] := RunningTotal; RunningTotal := RunningTotal + H[i]; End End i = 1 2 3 4 5 6 7 8 9 10 11 12 13 A[i] = [6.7 5.9 8.4 1.2 7.3 3.7 11.5 1.1 4.8 0.4 10.5 6.1 1.8 ] H[i] = 1 3 0 1 1 1 2 1 1 0 1 1 0 P[i] = 1 2 0 5 6 7 8 10 11 0 12 13 0 Step 4: Computing insertion locations, L[i], for each key, K=A[i], in array A For i := 1 to 13 do begin L[i] := P[MapKey (A[i])]; End i = 1 2 3 4 5 6 7 8 9 10 11 12 13 A[i] = [6.7 5.9 8.4 1.2 7.3 3.7 11.5 1.1 4.8 0.4 10.5 6.1 1.8 ] P[i] = 1 2 0 5 6 7 8 10 11 0 12 13 0 L[i] = 8 7 11 2 10 5 13 2 6 1 12 8 2 The final phase of proxmapsort consists in moving each key, A[i], in the original unsorted array A into the location L[i] at the beginning of its reserved future subarray, and in inserting it in ascending order into the sequence of keys already occupying its reserved subarray. If we had two copies of A, say A1 and A2, where A1 was the original unsorted array, and A2 was an initially empty copy of A designed to accumulate the keys of A in final sorted order as they were being inserted, then we could map each key, A1[i], into its insertion location, L[i], in A2, and insert it in ascending order into the sequence of keys beginning at L[i] in A2. i = 1 2 3 4 5 6 7 8 9 10 11 12 13 A1[i] =[6.7 5.9 8.4 1.2 7.3 3.7 11.5 1.1 4.8 0.4 10.5 6.1 1.8 ] L[i] = 8 7 11 2 10 5 13 2 6 1 12 8 2 A2[i] =[-.- 1.2 -.- -.- 3.7 -.- 5.9 6.7 -.- 7.3 8.4 -.- 11.5 ] After moving 7 keys into their reserved subarrays. i = 1 2 3 4 5 6 7 8 9 10 11 12 13 A1[i] =[6.7 5.9 8.4 1.2 7.3 3.7 11.5 1.1 4.8 0.4 10.5 6.1 1.8 ] L[i] = 8 7 11 2 10 5 13 2 6 1 12 8 2 A2[i] =[0.4 1.1 1.2 -.- 3.7 4.8 5.9 6.7 -.- 7.3 8.4 10.5 11.5 ] After moving 11 keys into their reserved subarrays. Some changes are needed to sort numbers between 0 and 1 to map them into 1 and N. Mapping alphanumeric fields. r(K) = base26value(K)/(1+Base26Value('Z..Z')). Function Base26Value (K:AirportCodeKey):Integer; Var n1, n2, n3:Integer; Begin n1 := (ord(K(1) - ord ('A')) * 26 * 26; n2 := (ord(K(1) - ord ('A')) * 26; n3 := (ord(K(1) - ord ('A')); Base26Value := n1+n2+n3; end; Note: We assumed the airport name is three characters only. Analysis of ProxmapSort In the worst case, ProxmapSort can take time O(n^2), if all the keys map to a single location. Its average is O(n).
      4. External (Merges) All of the algorithms we have studied so far have been internal sorts, that is, sorts that require the data to be entirely sorted in primary memory during the sorting process. We now turn our attention to external sorting, sorts that allow portions of the data to be stored in secondary memory during the sorting process.
      5. Merging Ordered Files A merge is the process that, given two files ordered on a given key, combine the files into one ordered file on the same given key. File 1: 1, 3, 5 File 2: 2, 4, 6, 8, 10 (input) File 3: 1, 2, 3, 4, 5, 6, 8, 10 (output)
      6. Merging Unordered Files In merge sorting, however, we usually have a different situation than shown above: The input files are not completely sorted. The data will run in a sequence and then there will be a sequence break followed by another series of data in sequence. The series of consecutively ordered data in a file is known as a merge run. Many different merge concepts have been developed over the years. Here are three that are representative:
        1. Natural In the natural merge, each phase merges a constant number of input files into one output file. Between each merge phase, a distribution phase is required to redistribute the merge runs to the input files for remerging. Input (2,300 unsorted records) Merge 1: Three merge runs: 1 - 500 1,001- 1,500 2,001- 2,300 Merge2: Two merge runs: 500 - 1,000 1,501 - 2,000 Merge (merge1 & merge 2) Merge 3: Three merge runs: 1 - 1,000 1,001 - 2,000 2,001 - 2,300 Distribution Merge 1: Two merge runs: 1 - 1,000 2,001 - 2,300 Merge 2: One merge run: 1,001 - 2,000 Merge (merge 1 & 2) Merge 3: Two merge runs: 1 - 2,000 2,001 - 2,300 Distribution Merge 1: One merge run: 1 - 2,000 Merge 2: One merge run: 2,001 - 2,300 Merge (merge 1 & 2) merge 3: One merge run: 1 - 2,300
        2. Balanced A balanced merge uses a constant number of input merge files and the same number of output merge files. The balanced merge eliminates the distribution phase by using the same number of input and output merge files. Input File Merge 1: Three merge runs: 1 - 500 1,001 - 1,500 2,001 - 2,300 Merge2: Two merge runs: 500 - 1,000 1,501 - 2,000 Merge (merge1 & merge 2 into merge 3 & 4) Merge 3: Two merge runs: 1 - 1,000 2,001 - 2,300 Merge 4: One merge run: 1,001 - 2,000 Merge (merge 3 & 4 into merge 1& 2) Merge 1: One merge run: 1 - 2,000 Merge 2: One merge run: 2,001 - 2,300 Merge (merge 1 & 2 into merge 3) Merge 3: One merge run: 1 - 2,300 File is sorted.
        3. Polyphase In the polyphase merge, a constant number of input merge files are merged to one output merge file and the input merge file are immediately reused when their input has been completely merged. Merge 1: three merge runs: 1 - 500 .... 1,001 - 1,500 .... 2,001 - 2,300 Merge 2: Two merge runs: 501 - 1,000 .... 1,501 - 2,000 .... Merge 3: (output) two merge runs: 1 - 1,000 1,001 - 2,000 The first two runs of merge1 & 2 will be merged into merge 3. merge 1 is still has a run. First merge phase complete -------------------------------------------------------------- Merge 1: 2,001 - 2,300 Merge 3: two merge runs: 1 - 1,000 1,001 - 2,000 Merge 2: (output)One merge run: 1 - 1,000 2,001 - 2,300 Second merge phase complete --------------------------------------------------------------- Merge 2: One merge run: 1 - 1,000 2,001 - 2,300 Merge 3: One merge run left: 1,001 - 2,000 Merge 1: 1 - 2,300 Third merge phase complete ---------------------------------------------------------------
    2. Comparison of Methods
      1. Use of Storage Space
      2. Use of Computer Time, and
      3. Programming Effort
    3. Other factors in selecting a sorting method
      1. Contiguous version
      2. Linked version
      3. Record Size
      4. Recursive version
      5. Non-recursive version
      6. Programming language used
      7. Its average case
      8. Best case
      9. Worst case
      10. One-time sorting
      11. frequently used
      12. List size

      Last update October 3,1998.