sort stability

When sorting some kinds of data, only part of the data is examined when determining the sort order. For example, in the card sorting example to the right, the cards are being sorted by their rank, and their suit is being ignored. The result is that it’s possible to have multiple different correctly sorted versions of the original list. Stable sorting algorithms choose one of these, according to the following rule: if two items compare as equal, like the two 5 cards, then their relative order will be preserved, so that if one came before the other in the input, it will also come before the other in the output.

Advertisements

XOR swap algorithm

http://en.wikipedia.org/wiki/XOR_swap_algorithm

Conventional swapping requires the use of a temporary storage variable. Using the XOR swap algorithm, however, no temporary storage is needed. The algorithm is as follows:

X := X XOR Y
Y := X XOR Y
X := X XOR Y

To understand it, think about the PLUS swap algorithm
a = a + b
b = a – b
a = a -b

interpret XOR: it is a binary PLUS operation without carry.

bloom filter

http://en.wikipedia.org/wiki/Bloom_filter

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.

To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.

While risking false positives, Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, in contrast, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. The 1% false-positive rate can be reduced by a factor of ten by adding only about 4.8 bits per element.

However, if the number of potential values is small and many of them can be in the set, the Bloom filter is easily surpassed by the deterministic bit array, which requires only one bit for each potential element. Note also that hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom filters with k = 1.[2]

Bloom filters also have the unusual property that the time needed either to add items or to check whether an item is in the set is a fixed constant, O(k), completely independent of the number of items already in the set. No other constant-space set data structure has this property, but the average access time of sparse hash tables can make them faster in practice than some Bloom filters. In a hardware implementation, however, the Bloom filter shines because its k lookups are independent and can be parallelized.

To understand its space efficiency, it is instructive to compare the general Bloom filter with its special case when k = 1. If k = 1, then in order to keep the false positive rate sufficiently low, a small fraction of bits should be set, which means the array must be very large and contain long runs of zeros. The information content of the array relative to its size is low. The generalized Bloom filter (k greater than 1) allows many more bits to be set while still maintaining a low false positive rate; if the parameters (k and m) are chosen well, about half of the bits will be set, and these will be apparently random, minimizing redundancy and maximizing information content.

count sort

http://en.wikipedia.org/wiki/Counting_sort

In summary, the algorithm loops over the items, computing a histogram of the number of times each key occurs within the input collection. It then performs a prefix sum computation (a second loop, over the range of possible keys) to determine, for each key, the starting position in the output array of the items having that key. Finally, it loops over the items again, moving each item into its sorted position in the output array.[1][2][3]

In pseudocode, this may be expressed as follows:

''' calculate histogram: '''
# k is the right end of keys' range 
# allocate an array Count[0..k] ; THEN
# initialize each array cell to zero ; THEN
for each input item x:
    increment Count[key(x)]

''' calculate starting index for each key: '''
total = 0
for i = 0, 1, ... k:
    oldCount = Count[i]
    Count[i] = total
    total = total + oldCount

''' copy inputs into output array in order: '''
# allocate an output array Output[0..n-1] ; THEN
for each input item x:
    Output[Count[key(x)]] = x
    increment Count[key(x)]
return Output

After the first for loop, Count[i] stores the number of items with key equal to i. After the second for loop, it instead stores the number of items with key less than i, which is the same as the first index at which an item with key i should be stored in the output array. Throughout the third loop, Count[i] always stores the next position in the output array into which an item with key i should be stored, so each item is moved into its correct position in the output array.[1][2][3] The relative order of items with equal keys is preserved here; i.e., this is a stable sort.

Analysis

Because the algorithm uses only simple for loops, without recursion or subroutine calls, it is straightforward to analyze. The initialization of the Count array, and the second for loop which performs a prefix sum on the count array, each iterate at most k + 1 times and therefore take O(k) time. The other two for loops, and the initialization of the output array, each take O(n) time. Therefore the time for the whole algorithm is the sum of the times for these steps, O(n + k).[1][2]

Because it uses arrays of length k + 1 and n, the total space usage of the algorithm is also O(n + k).[1] For problem instances in which the maximum key value is significantly smaller than the number of items, counting sort can be highly space-efficient, as the only storage it uses other than its input and output arrays is the Count array which uses space O(k).[5]

test randomness

  1. http://en.wikipedia.org/wiki/Statistical_randomness
  • The frequency test, was very basic: checking to make sure that there were roughly the same number of 0s, 1s, 2s, 3s, etc.
  • The serial test, did the same thing but for sequences of two digits at a time (00, 01, 02, etc.), comparing their observed frequencies with their hypothetical predictions were they equally distributed.
  • The poker test, tested for certain sequences of five numbers at a time (aaaaa, aaaab, aaabb, etc.) based on hands in the game poker.
  • The gap test, looked at the distances between zeroes (00 would be a distance of 0, 030 would be a distance of 1, 02250 would be a distance of 3, etc.).
  • chi-square test,

https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test

https://en.wikipedia.org/wiki/Chi-squared_distribution

If a given sequence was able to pass all of these tests within a given degree of significance (generally 5%), then it was judged to be, in their words “locally random”.

  1. http://www.drdobbs.com/testing-random-number-generators/184403185