Look! Car accident!

published: Sun, 24-Feb-2008   |   updated: Thu, 28-Feb-2008

Let me apologize in advance, but I just couldn't help it. You know the way traffic slows down to look at the aftermath of an accident in the other lane? Well, that's me with Data Structures and Algorithms in C#: I just had to go read another chapter last night. Well, my reasoning went, that chapter on basic sorting could have been the exception, not the rule. Surely I should just check another chapter to make sure? It would be the honorable thing to do, right? So, I got my flashlight, made a tent under the bed covers so no one could see me, and started reading.

The chapter was chapter 4, Basic Searching Algorithms, pp. 55-67. You know, sequential search and binary search; standard stuff. Now, it's been well reported that sometimes programmers make mistakes in writing a binary search routine (see Tim Bray on the subject), but I'll admit this is the first time I've seen someone who can't write a sequential search properly.

Look at this code (p.56): can you spot the bug?

public bool SeqSearch(int[] arr, int sValue) {
  for (int index = 0; index < arr.Length-1; index++) 
    if (arr[index] == sValue)
      return true;
    return false;
}

No, really, peruse the code and find the bug before reading on. I will say though that if Mr. Mcmillan had used foreach there would be no bug. But, for some reason, foreach remains just outside his grasp. There is another subtle issue, which isn't necessarily a bug, in that this method doesn't use any local fields from the class it's declared in, so it could be marked as static. (Yes, you guessed it, I think the class in question is our old friend CArray.)

So the bug is that the for loop excludes the last element in the array, the ending condition for the for loop includes a spurious subtraction of 1. If you are searching for the value of last element, this method will return false. Now, get this: all of the sequential search methods printed in this chapter all suffer from the same bug. Want to know the index of the found element? Certainly, but not if it's at the end. Want to find the maximum value in an array? Sure, so long as it's not the last element. The minimum? Yep, but you've got to make sure that it doesn't happen to be the final one.

This bug just shows that Mr. McMillan is just not comfortable or familiar with the for loop in C languages.

But, wait!, I hear you cry: surely there's some test code that properly tests the method and shows how to call it? Won't that have found this error? Well, there is, kinda. Imagine a test that reads a set of numbers from a text file in the root directory of your C: drive into an array, outputs a message onto the console, saying "Enter a number to search for", reads the value from the user, and then outputs whether that number is in the set or not. Oh, and there had better be exactly the right number of values in the text file as are in the array (that would be 100) otherwise you'd be debugging some other problem. If you're talking about that kind of test code, well, you're in luck (p.56). If you're talking about the kind of test code which checks that the first item is found, the last item as well, also one in the middle of the array (and that a number smaller than the first or larger than the last is not found, etc), then you're going to have to write it yourself since the author evidently never has.

For fun, here's a sentence from the section on finding the maximum and minimum values in an array, using sequential search (p.59): "Notice that the array search starts at position 1 and not position 0." Here's the for loop declaration it's talking about:

for (int i = 0; i < arr.Length-1; i++) 

Yes, our old friend the off-by-one bug is still there, but the code actually starts at 0, not 1. In this case the text is right, but the code is (slightly) wrong. No one read this chapter (or the rest of the book, it seems) from the technical proofing viewpoint.

Then the chapter veers off into the mud of self-organizing data, but it's no 4x4. On page 60 we see this: "See Knuth (1998, pp. 399-401) for [...]". Nice, our first reference. So I looked at the References section on page 339 to see exactly which book by Knuth he's talking about — I mean, geez, Don Knuth has written so bloody many — only to find that there are no Knuth books cited. Hello? (It's the second edition of the volume 3 of The Art of Computer Programming, if you really want to know; it talks about Pareto distrbutions.)

In essence, the point being made in this section is that sequential search becomes faster if you try and make sure that the items you're looking for are at the front of the array. So, after a successful search you move the item to the front. This is a great idea for linked lists, since the removing of a node and inserting it somewhere else are fast constant-time operations (careful though, remember we can't talk about big-Oh notation in this book) and I've used it in the past for buckets in hash tables when using chaining. After a while of using this algorithm (deleting the found item from where it was in the list and inserting it at the front), you find that the items most searched for tend to be at the beginning of the list.

Unfortunately, this section, I have to say, is pretty much a disaster, both in terms of the text and the code. An example is this:

If the search is successful, the item found is switched with the element at the first of the array using a swap function

And, if that weren't bad enough the code illustrating this point shows the item found being switched with the one immediately prior to it in the array, and not the first:

static bool SeqSearch(int sValue) {
  for (int index = 0; index < arr.Length-1; index++)
    if (arr[index] == sValue) {
      swap(index, index-1);
      return true;
    }
  return false;
}

Now, there are several things wrong here. First of all, the algorithm described in the text is just wrong -- well, okay, call it pretty useless -- since the first item in the array just acts as a one item cache: every time you successfully search for an item the previously searched-for item just gets swapped away. There's no aggregation of frequently used items.

Second, the code just won't compile. It's a static method with non-static fields. Third, as I said before, the code doesn't even illustrate the subject of the sentence at all. Fourth, there's our old friend, the off-by-one bug, and then there's a new bounds check bug. (What happens if the item at 0 is the one found? A swap with the item at position -1. Oops, crash.)

A bit later on, after some code I really don't want to get into, we get the slightly better idea of just swapping a found item with its neighbor to the left (ah ha!). Favored items would slowly percolate to the front. We are then introduced to the ultimate sequential search that returns the index at which the value was found (rather than true/false), and has self-organization thrown in as a bonus. It's like everything rolled into one, a burrito from Chipotle:

static int SeqSearch(int sValue) {
  for (int index = 0; index < arr.Length - 1; index++)
    if (arr[index] == sValue) {
      swap(index, index - 1);
      return index;
    }
  return -1;
}

With this bit of code you not only get all the other bugs you had before, but this time you also get a brand new bug thrown in for the same price: if the item is found, the method returns the wrong index. Yessss! This is possibly the crème de la crème, the pièce de résistance of the whole chapter. It is breathtaking. 4 bugs in 6 lines of code, and printed in a book. I am in awe.

It seems almost churlish to point out that this sentence in the text explaining this method is wrong too:

This technique [swapping the found item with its neighbor] also guarantees that if an item is already at the beginning of the data set, it won't move back down.

(Well, duh, if the second item is found it'll move to the beginning, swapping out the "guaranteed never to move" first item.)

And then, after all that, we get to binary search. Remember: the algorithm that's hard to get right? Well, the code was correct. OK, a teensy white lie: it wouldn't compile because of that pesky static keyword, but if you got rid of that, it would compile and work. A classic implementation of binary search; and by classic I mean the calculation of the middle index could overflow (see Tim Bray's post again).

The text was wrong, though: "The algorithm iterates until the lower bound equals the upper bound" should read "is greater than" but a minor point since the code was essentially right.

The test code though that exercised the binary search was uncompilable for a start (no, I'm not typing it in so you can see) and had a quite remarkable bug. The binary search method returned the index of the found item or -1 if it wasn't there, but the test code had something similar to this:

int position = someObj.binSearch(...);
if (position >= -1)
  Console.WriteLine("Found item");
else
  Console.WriteLine("Not in the array");

Which is just a hoot. It seems the test code was never tested since it would report that all items, whether in the array or not, were really in the array.

Then came a recursive binary search whose code didn't work and would never have worked (hint: you have to return the result of a recursive call, Mr. McMillan). But we'll gloss over that since I can't be bothered to type the code in again (there's only so much CodeRush can help with). At least it compiles. He reports it's 10 times slower than the iterative version. Hmm, I wouldn't have thought that recursion would add that much.

Then the real red flag. Mr. McMillan reports that the Array class already has a static binary search method (true), and that "it consistently performs 10 times faster than the custom-built method". That's it. No pondering from the author there on how this might be (or indeed how he measured it — we have no code remember), no whimsical thoughts on how the Microsoft CLR team have improved binary search so well that it always performs 10 times faster than a hand- coded implementation, no curiosity at all.

Well, bugger that, Mr. McMillan, I'm going to verify this myself. Since I have no idea what you measured (although it wouldn't surprise me that you measured just one call; shock me? yes, but not surprise me), I'll devise my own test. I'll add all the odd numbers from 0 to 10000 to an int[] array, and then call my binary search and Microsoft's on every number, odd and even, in that range. Probably several times in order to get something measurable, rather than getting just "noise". May the best man win!

static private int MyBinarySearch(int[] data, int value) {
  int lower = 0;
  int upper = data.Length - 1;
  while (lower <= upper) {
    int mid = lower + ((upper - lower) / 2);
    if (data[mid] < value)
      lower = mid + 1;
    else if (value < data[mid])
      upper = mid - 1;
    else
      return mid;
  }
  return -1;
}

static private void TestBinarySearch() {
  int[] data = new int[5000];

  int value = 1;
  for (int i = 0; i < data.Length; i++) {
    data[i] = value;
    value += 2;
  }

  Timer timer = new Timer();
  timer.Start();
  for (int cycles = 0; cycles < 100; cycles++) {
    bool expectedResult = false;
    for (int i = 0; i <= 10000; i++) {
      if ((MyBinarySearch(data, i) >= 0) != expectedResult)
        Console.WriteLine("error");
      expectedResult = !expectedResult;
    }
  }
  timer.Stop();
  Console.WriteLine("Time for MyBinarySearch is " + timer.Duration());

  timer.Start();
  for (int cycles = 0; cycles < 100; cycles++) {
    bool expectedResult = false;
    for (int i = 0; i <= 10000; i++) {
      if ((Array.BinarySearch<int>(data, i) >= 0) != expectedResult)
        Console.WriteLine("error");
      expectedResult = !expectedResult;
    }
  }
  timer.Stop();
  Console.WriteLine("Time for Array.BinarySearch is " + timer.Duration());
}

Oh, and I decided to add the verification check to the performance testing method just to make sure that I'd coded binary search properly. And yes there's code duplication in there and it could be refactored out, but this is quick one-off.

And the result? Even with Grace Jones playing on iTunes (Pull Up to the Bumper from Nightclubbing), the custom-built method wins by running in about 4 units of time versus the built-in method's 5 units. The latter takes about 25% more time than the former. So, in conclusion, I have absolutely no idea where Mr. McMillan gets his figures from, but they're wrong. (Note the binary search code above was written by me and it's pretty similar to the code in the book, but I'm using the new accepted way of calculating the mid-point.)

Oh, and the recursive version is about the same speed as Array.BinarySearch. It seems that if Mr. McMillan reports some timing values, you just have to take them with a truckload of salt.

Increasingly, it's becoming obvious that you should buy this book: it's a laugh a minute.

[This is part 2 of a series of posts on this book. Part 1. Part 3. Part 4. Enjoy.]