Car accident? Or train wreck?

published: Mon, 25-Feb-2008   |   updated: Thu, 28-Feb-2008

...what's the difference?

Last night I read some more of Data Structures and Algorithms Using C# by Michael McMillan, but this time I wasn't reading in depth, instead just skimming over the material. Through it all, my emotions veered from a horrified awe (the old "look at that car accident" syndrome) to anger.

Why anger? This took me some thought to understand after I'd switched off the light. Part of the reason is that I find data structures and algorithms fascinating. I am passionate about the subject. I have a good half of a 6 foot tall bookcase behind me full of algorithms books. I collect them. I love seeing how different authors approach the same algorithms, it gives me new perspectives.

Looking at an algorithm, trying to understand how it works, teasing out the peculiarities, seeing how the edge cases are catered for, is like the sense of wonder I experience when I see the interior of a Swiss-made mechanical watch: the precision and accuracy is just awe-inspiring. Take binary search for example. It's an archetypal example of a divide-and-conquer technique (each step of the algorithm halves the amount of work we have to do, glorious!), something that we programmers should always be on the lookout for. From divide-and-conquer, it's a short step to dynamic programming (itself a fascinating area of study).

Binary search is a great algorithm in its own right, but it took a long while for it to be proven valid (at first, in the 50s, computer science researchers were only certain it worked for sorted arrays whose size was a power of 2). If you're stuck on a desert island with no internet connection or books but with a laptop (hey, OK, bear with me here), it's a hard algorithm to implement correctly at first, showing you that you positively and utterly need to test your implementation thoroughly. So you become more aware of the edge cases that can bite you in the leg and test the heck out of them so that their danger is well contained.

And then, in 2006, Joshua Bloch at Google found a bug. It was like Einstein pointing out that Newton's laws of gravitation were only a first approximation. Our understanding of binary search improved, and we got to appreciate the edge cases again and binary search remains as elegant as it always was.

For me, this sense of wonder, of accomplishment, of passion, of inquiry, is not present in Michael McMillan's book. Instead what comes across for this particular reader is a sense of "let's just get this over with as fast as possible". Care is not taken, either in the descriptions or in the code, and this perceived carelessness makes me angry. We are not talking about any algorithms that are particularly hard here. No, we are talking about the simplest algorithms, that in an undergraduate computer science course are processed and finished with in the first few weeks of the first semester.

I can see this carelessness in Michael McMillan because I've suffered from it myself. I used to write a monthly article for The Delphi Magazine (now sadly dead) on algorithms. There were some months that I wrote about an algorithm and I just didn't care enough about it to research it properly or to make it interesting. It would be a slog to finish that article and code. We've all had jobs or projects like that: just get the bloody thing done so you can go do something else more interesting. For me, Michael McMillan's book is the embodiment of that attitude and feeling.

An example. Mr. McMillan avoids the whole big-Oh notation. Sure, it uses mathematical notation (what's a logarithm again?) so it might put off some readers, but what it describes is fundamental to algorithms. Robert Sedgewick says that if an algorithm runs in O(n2) or worse, it's worthless in the real world, but to understand where he's coming from, to argue with his viewpoint, you have to have a gut feeling for how algorithms run. And, so, in an algorithms book I'd say you have no option but to include some kind of analysis. Profile the damn algorithm and show that if you have double the number of items or double the problem space, it'll take 4 times as long to process. That's important information, from which you can derive estimates that help you in the real world. Fine, you don't have to write the conclusion as O(n2) but the information itself is of vital importance. Not including it is a major disservice to your readers.

And so this feeling of anger creeps over me. People will have bought this book to understand algorithms and data structures, especially when using C# and the .NET Framework, and Mr. McMillan seems to systematically hide vital information without even saying that he is doing so. For example, in the chapter on binary search trees, there is one sentence at the end of the summary that warns against unbalanced binary search trees. Is there an additional sentence that exhorts the reader to look up AVL or red-black trees in this, that, or the other advanced algorithms book? Nope. The reader is left with the vague worry that they'll have to guard against unbalancing their trees without any real understanding of how to do that. I'm sorry but that attitude is unconscionable. It just pisses me off. It's made worse because Chapter 15 does discuss AVL trees and red-black trees. Cripes! Is writing a quick "See Chapter 15 for some ways around this." so bloody hard?

Another reason for my anger is in how utterly badly this book has been tech edited. I realized last night that this book is a translation of the author's previous book written for VB.NET. How do I know this? Direct evidence, that's how (and then I went online to verify). There are places in the text (p.96 as an example) where it assumes that the reader is a VB.NET programmer. There are places where the code block is just in VB.NET (p.131, for instance). There are other places where it seems that Mr. McMillan merely converted the text of the code (badly, I might add) in the document to C# and never, ever took it anywhere near Visual Studio or, heaven forbid, even tried to compile it (p.139 has a code block that's partly C# — look, Ma, semicolons! — and partly VB — New with a captital N, etc). There are even some places where it looks like search-and-replace was used with no one reading the result; for example, the following sentence only really makes sense if "C#" was originally "VB" (p.93): "The ability to derive new classes from classes in the .NET Framework class library is one of the major strengths of the .NET version of C#." Elsewhere the text refers to "Sub Main", or "ReDim Preserver", the "ASC function", "Nothing" (instead of null), and so on.

And I've already gone on at length about the source code that's printed in the book. Look, it's really not as if I was deliberately just showing the bad bits and ignoring the good bits. The bad bits are all over, practically in every code block. To prove it, I opened the book just now at a page I haven't read yet (pp. 260-261). It happens to be the code for quicksort (oh, no). I'll type the method at the beginning of the code:

public void RecQSort(int first, int last) {
  if ((last-first) <= 0) 
    return;
  else {
    int pivot = arr[last];
    int part = this.Partition(first, last);
    RecQSort(first, part-1);
    RecQSort(part+1, last);
  }
}

It looks pretty good, yes? Here are my issues. First, it doesn't declare in which class this method is found. One assumes that arr is an array-like field in this unknown class. And what an awful name for the method. Next, why use such a complicated condition for the if? Why not "if (first >= last)"? It expresses the intent a little clearer, no? (Later on in the same code block there is a direct comparison between first and last, without recoursing to subtracting them and comparing to 0.) The else keyword is superfluous: if the if clause is taken you get out of the method immediately (OK, I'll agree that this is personal preference, but I find it easier to read if there is not so much indentation). Next a pivot value is calculated and never used again. (The screamingly funny thing I now see is that the first line of the Partition() method calculates its pivot as being the first element in the range and not the last.) The use of "this" to scope Partition()? Why?

No, I'll agree that my little example here didn't actually produce any bugs, but even so, this simple method could be cleaned up to make it easier to read

class CArray {
  ...
  public static void RecursiveQuickSort(int[] arr, int first, int last) {
    if (first >= last) 
      return;

    int pivot = Partition(arr, first, last);
    RecursiveQuickSort(arr, first, pivot-1);
    RecursiveQuickSort(arr, pivot+1, last);
  }
}

Or even better, remove the tail recursion (a technique not explored in this book, by the way, although it's of vital importance to algorithm design and implementation):

class CArray {
  ...
  public static void RecursiveQuickSort(int[] arr, int first, int last) {
    while (first < last) {
      int pivot = Partition(arr, first, last);
      RecursiveQuickSort(arr, first, pivot-1);
      first = pivot + 1;
    }
  }
}

(Oh, god, I've just started reading the code in Partition(). You bastards, it's your fault, you made me do this. There's a variable called okSide. There's a do loop without a while clause. My eyes, my eyes, they burn...)

For a technical book, all this is just inexcusable. The editors at Cambridge University Press are very much to blame here too: they should have insisted on it being properly tech edited. (I will ignore the sudden thought that it was, but the tech editor's recommendations were passed over.) A technical book like this lives or dies by the facts it presents, and for me this book mostly dies. This lack of tech editing just pisses me off.

And, another reason for anger this, it just seems to me that when Michael McMillan doesn't know something, he just makes stuff up. Either that, or his research or note-taking is woefully inadequate. Let me explain.

This book is targeted at C# programmers, right? It's there in the title. It is assumed therefore that they are using the .NET Framework. Ditto for Mr. McMillan. In fact, parts of the book are explanations of how to use the existing data structure classes in the .NET Framework. So you would assume that Mr. McMillan is au fait with the Framework and when he doesn't know something will use Reflector to go directly to the source to find out or write tests to deduce the answer. After all, he is teaching basic Computer Science here, and a scientist will research and experiment to find out the answer and not just invent it on the spot.

Here goes (I'm using Reflector on .NET Framework 2.0.50727.1433, by the way). These are only the things I've found, there may be others — remember I'm admitting to not having read it all.

p.72-73: "The [Stack] class [in the .NET Framework]" is implemented as a circular buffer". No, it's not, and I have to say it would be the most bizarre way to build a stack. The non-generic version is implemented using a array of object and a stack pointer. The generic version is pretty much the same implementation but using an array of T.

p.172: "the SortedList class is a specialization of the DictionaryBase class." No, it isn't, whether you are talking about the non-generic or generic versions. Both of the non- generic classes implement IDictionary, sure, but since I've found no mention of interfaces in this book (and, no, I'll admit I haven't read it all), it's not surprising that this is ignored or smeared into "specialization of DictionaryBase".

p.184: "The strategy the [HashTable] class uses to avoid collisions is the concept of a bucket. A bucket is a virtual grouping of objects together that have the same hash code, much like we used an ArrayList to handle collisions when we discussed chaining. If two keys have the same hashcode, they are placed in the same bucket." Where to start with this? It's just completely false, but the flights of fancy are impressive: "virtual grouping"?. The HashTable class uses probing with double hashing to resolve collisions, end of story. Since the book doesn't even begin to talk about how to insert or delete entries from a hash table that uses probing, it's a little rich to say so much about so little. The Dictionary<K,V> class does use chaining, but this particular generic class is completely ignored in this book. Now I could almost take the linguistic gymnastics and hand-waving that if two keys hash to the same value they form part of a "virtual grouping" (I'm air-quoting here with my fingers) but there's no assumption or discussion here that justifies it.

p.184: "The number of buckets used in a HashTable objects [sic] is called the load factor. The load factor is the ratio of the elements to the number of buckets. Initially the factor is set to 1.0. When the actual factor reaches the initial factor, the load factor is increased to the smallest prime number that is twice the current number of buckets. The load factor is important because the smaller the load factor, the better the performance of the HashTable object." This is a complete mishmash between half-truths and utter falsehoods, and it's hard to separate them. First off, the first two sentences define load factor in different ways: it's the number of buckets — oh no, it's not, it's the ratio of elements to buckets. This is panto, not technical writing. The second answer is correct, by the way; the number of buckets is just the capacity of the hash table. Onwards. Although the load factor defaults to 1.0, internally it's stored as 0.72 (the reason for this weird number is discussed in Knuth's TAoCP, and I think the CLR team settled on this actual value through experimentation), and when the actual load factor reaches this, the hash table size (the number of buckets) is roughly doubled, meaning the run-time load factor is roughly halved. And yes, the new size is selected as the nearest prime. The final sentence is true, though it would have been nice for Mr. McMillan to say why (in essence, at the 0.72 ceiling, the average successful probe length is about 2, and the average unsuccessful probe length is 5, and going beyond that will slow things down dramatically). However, that would have required more information in the section on probing to give a flavor of the reason. By the way, I just love the whole "when the load factor reaches 1.0, we double it" explanation. Jus' kickin' in the turbo boost. Sheer brilliance. It reminds me of that bit in Spinal Tap when the guitarist says that the knobs on his amplifiers go to 11, one more notch than everyone else's.

p.216. "The .NET Framework library uses a circularly linked list design to implement the ArrayList data structure." This is just so mind-bogglingly false, that my head exploded on reading it. And mind you, Mr. McMillan uses ArrayList all over the place, it's by far his most favorite class in the whole of the .NET Framework (List<T> is mentioned nowhere that I can see). To use it so often and to not even know how it's implemented is frankly bizarre. Now it may just be that he meant to say something else here, but I've read and reread this section and I can't see an alternative meaning. Maybe some other words are missing, maybe some other class was supposed to be mentioned, maybe the sentence should have been struck, or maybe, you know, just a hint here, the whole ruddy book should bloody well have been properly tech-edited. Ya think?

Finally, there is another reason for my being angry. I had an opportunity to write a C# algorithms book a while back, but I did nothing and have done nothing about it. This book makes me bitter at myself for being a lazy bastard and missing this opportunity, and in some ways that's the hardest anger to bear. The book is holding up a mirror to myself.

This is it, the final essay I want to write on this book. The pages I've read in depth are scribbled with pencilled annotations, ready for more ranting, but I've now had enough. The book will go onto my bookshelves, only to be brought down when I need a laugh and to shake my head in wonder. If you want to do the same, or if you want to see how not to write C# or are a budding author ready to start your first computer book, then I can recommend this book. It's full of the mistakes you should be avoiding like the plague. I guarantee that if you can see them and can vocalize why they're bad, you are well on the way to being a successful programmer or author.

For everyone else, I certainly do not recommend it. In my ever so humble opinion, Data Structures and Algorithms Using C# is quite simply the worst computer book I've ever seen.

[This is part 3 of a series of posts on this book. Part 1. Part 2. Part 4. Enjoy.]