Lock-free data structures: the stack

published: Thu, 27-Oct-2005   |   updated: Sat, 18-Apr-2009

Notes: Several people have emailed me to say this code suffers from the ABA problem. It doesn't, but only because it's written in C# and implictly uses the .NET garbage collector. If you just, quote, convert it to C++ or Delphi, unquote, it will not work and will suffer from some form of ABA. See here for details.

------

It's a fact that lock-based concurrency doesn't scale: only one thread at a time can execute the thread-safe code that's being protected by a lock or a mutex. The more threads there are accessing the lock, the more will be just sitting there waiting. (Actually "waiting" implies that they're awake: in fact, the operating system will have put them to sleep and will awaken them one at a time to acquire the lock.)

In pseudo-code, we have for a thread:

do forever
  acquire mutex (or other lock) to resource
  do work on protected resource 
  release mutex
  maybe do other work not requiring resource
end do

Meanwhile, the other threads may be doing other work, but eventually they grind to a halt, all stuck waiting for the lock to be released. And when that happens only one thread can be woken up to acquire the lock. We have the effect of a hourglass where only one grain of sand can get through the restriction at any one time, and grains that did get through can make their way to the top again. The more grains of sand the bigger the wait. Similarly the more threads using the protected resource the longer they'll have to wait proportionally to the amount of work they do.

It's worse as well if the machine being used has two or more CPUs, using a lock-based approach means in effect that you can't really take advantage of all CPUs. And more systems are being sold that have two CPUs on the same die, or that use hyperthreading to mimic two CPUs.

That's why an awful lot of work has been done over the past 10 or 15 years on lock-free or non-blocking data structures. There is a precise meaning to the term lock-free by the way: A data structure is lock-free if it guarantees that after a finite number of steps of any operation on the data structure, some operation completes.

Lock-free data structures are usually implemented around some retry logic (a simple loop) as well as a strong synchronization primitive. The most widely-used such primitive used in these data structures is known as Compare-And-Swap (CAS). (A synchronization primitive is an atomic operation that cannot be interrupted and is usually implemented in hardware.) Windows (and therefore the .NET Framework) don't have CAS, but it can easily be implemented in software using another primitive that they do have.

Let's implement a lock-free stack using a linked list as the underlying data structure (you'll see why in a moment) to explore lock-free programming.

There are two operations on a stack: Push adds an item to the top of the stack and Pop removes the top item. It's extremely easy to implement such a stack in a single threaded environment.

  public class SimpleStack<T> {

    private class Node<V> {
      public Node<V> Next;
      public V Item;
    }

    private Node<T> head;

    public SimpleStack() {
      head = new Node<T>();
    }

    public void Push(T item) {
      Node<T> node = new Node<T>();
      node.Item = item;
      node.Next = head.Next;
      head.Next = node;
    }

    public T Pop() {
      Node<T> node = head.Next;
      if (node == null)
        return default(T);
      head.Next = node.Next;
      return node.Item;
    }
  }

This implementation uses a dummy node for head, off from which we hang the rest of the stack. This makes it easier to code the Push() method: we don't have to check for a null head.

Anyway, let's analyze Push() from a multithreaded viewpoint now. Getting a new node doesn't affect the stack object at all: node is just a local variable. Admittedly we are using a global service (the heap manager) and that may be affected by other threads. (Hold that thought as we continue.) The next line is all about local variables and can't be affected by other threads.

The next line is more interesting. The head variable is a field of the (shared) object, but it's never changed. We get the value of its Next field and store that in a local variable (actually a field of a local variable, which amounts to the same thing). You may not know this, but .NET guarantees that reading head.Next is an atomic operation. In other words, no other thread can jump in half way through us reading head.Next and change its value so that we somehow get half of the old value and half of the new one.

The next line is the horrific one, in a multithreaded sense. Just before this line executes we have made sure that our node points to the node currently pointed to by head. But, and its a very big but, any other thread can come along and do a Push() (head now points to another new node) or a Pop() (head now points to the next node along) before our next line executes. If this does happen, we'll trash the linkage and the stack. This is why we usually add a lock to protect our access to the head node:

  public void Push(T item) {
    Node<T> node = new Node<T>();
    node.Item = item;
    lock (head) {
      node.Next = head.Next;
      head.Next = node;
    }
  }

Instead of doing this, we're going to be clever and use the CAS function. In C# the CAS function would look a little like this:

static bool CAS(ref object destination, object currentValue, object newValue) {
  if (destination == currentValue) {
    destination = newValue;
    return true;
  }
  return false;
}

But of course the big problem is that, as written, it's not atomic. Any old thread could mess us up in the middle. In some operating systems or on some hardware there is a routine that is guaranteed to do this compare and swap atomically, but Windows doesn't have it. So we use another: Interlocked.CompareExchange. This method compares destination to currentValue, and if they're equal sets destination to newValue and returns currentValue. If they aren't equal it just returns the value of destination. And all this is done in an atomic fashion: no other thread can interrupt it. (In fact, it's done in hardware with the CMPXCH assembly instruction.)

Since we're in C# generics-land, here's the one I actually wrote:

  private static bool CAS(
    ref Node<T> location, Node<T> comparand, Node<T> newValue) {
    return 
      comparand == Interlocked.CompareExchange<Node<T>>(
                     ref location, newValue, comparand);
  }

Now we can use this method with some retry logic to effect linking the new node without blocking other threads. Are you ready?

  public void Push(T item) {
    Node<T> node = new Node<T>();
    node.Item = item;
    do {
      node.Next = head.Next;
    } while (!CAS(ref head.Next, node.Next, node));
  }

In other words, we set our node's Next value to head.Next, and try and set the head.Next value to our new node provided that head.Next hasn't changed in the meantime. If it has we just try again. And again. Eventually we shall succeed. Or rather, we can't guarantee that a particular thread may not succeed for a very long time, rather that there will always be a particular thread that will be moving forward at some particular time.

If you think that this is not good enough (this loop not being guaranteed to exit), consider this: in a normal locking situation, it's not guaranteed that a particular thread will not wait for a very long time at a lock. The operating system makes no guarantees about how long a particular thread waits, just that only one thread will be woken up to acquire the lock once the current thread releases it.

The pattern shown here is essentially how you do lock-free programming: read the current value from the variable, try and set the new value provided that the variable has not been changed. Repeat and rinse until we're successful.

So, now the Pop() method looks like this:

  public T Pop() {
    Node<T> node;
    do {
      node = head.Next;
      if (node == null)
        return default(T);
    } while (!CAS(ref head.Next, node, node.Next));
    return node.Item;
  }

Of course, we shouldn't take my word for it that this contraption of bizarre code works, we have to try this out on a dual (or more) processor machine. (Big tip for multithreaded development: always use a machine with more than one processor to test your multithreaded code. It'll fail faster if there's a bug.)

So I wrote this juicy little application: it pushes 10000 integers onto a stack in one thread and pops them all from anotherat the same time. Since I have a dual processor machine here, I figured that would be the best test to do.

The first thread just pushes the integers 1 to 10000 onto the stack. The second thread pops them off and keeps tallies of what it's seen to make sure that it pops off exactly one of each value and there are exactly 10000 of them. (Note: the way I wrote the Pop() method is to return the default value for the instantiating type if there was nothing on the stack. For an int, the default value is 0, which is why I don't use 0 as a value to push on the stack.)

  class Program {

    public const int topValue = 10000;

    class PusherEngine {
      private SimpleStack<int> stack;
      public PusherEngine(SimpleStack<int> stack) {
        this.stack = stack;
      }
      public void Execute() {
        for (int i = 1; i <= topValue; i++)
        {
          stack.Push(i);
        }
      }
    }

    class PopperEngine {
      private SimpleStack<int> stack;
      private Thread pusherThread;

      public PopperEngine(SimpleStack<int> stack, Thread pusherThread) {
        this.stack = stack;
        this.pusherThread = pusherThread;
      }

      public void Execute() {
        bool[] poppedValues = new bool[topValue + 1];
        int poppedInt;

        do {
          poppedInt = stack.Pop();
          if (poppedInt != 0) {
            if (poppedValues[poppedInt])
              Console.WriteLine(string.Format("{0} has been popped before!", poppedInt));
            poppedValues[poppedInt] = true;
          }
        } while (poppedInt != topValue);

        pusherThread.Join();
        Console.WriteLine("pusher now finished");

        poppedInt = stack.Pop();
        while (poppedInt != 0) {
          if (poppedValues[poppedInt])
            Console.WriteLine(string.Format("{0} has been popped before!", poppedInt));
          poppedValues[poppedInt] = true;
          poppedInt = stack.Pop();
        }

        Console.WriteLine("checking output");
        for (int i = 1; i <= topValue; i++) {
          if (!poppedValues[i])
            Console.Write(string.Format("{0} is missing", poppedValues[i]));
        }
      }
    }

    static void Main(string[] args) {

      Console.WriteLine("create the shared stack");
      SimpleStack<int> stack = new SimpleStack<int>();

      Console.WriteLine("create the threads");
      PusherEngine pusher = new PusherEngine(stack);
      Thread pusherThread = new Thread(new ThreadStart(pusher.Execute));
      PopperEngine popper = new PopperEngine(stack, pusherThread);
      Thread popperThread = new Thread(new ThreadStart(popper.Execute));

      Console.WriteLine("start the threads");
      pusherThread.Start();
      popperThread.Start();

      popperThread.Join();

      Console.WriteLine("Done");
      Console.ReadLine();
    }
  }

Because of the amount of code that gets executed in the popper thread, the pusher thread will usually finish pushing stuff much earlier than the popper thread can pop it off.

So what about the couple of items I asked you to bear in mind earlier on? First, we used a linked list for the data structure in order that we didn't have to pre-size the stack to use an array (for example). The big issue here would be how to grow the array when the stack filled. It's difficult without locking the array. Say we had a boolean value that we could test to see if if was safe to use the internal array. The issue is that if we detect true (it's OK to use the internal array) then another thread might creep in and clear the flag and grow the array before we had a chance to use it (so we may end up altering the array that was being thrown away).

Second, the heap manager is going to have to ensure multithreaded access to allow many threads to allocate memory at once and probably uses a locking methodology. So although we now have a super-cool lock-free stack, we're using a locking heap manager inside. There's not a lot we can do about the garbage collector, but at least that's an equal-opportunity system: all .NET threads are going to be suspended during a collection.

In my article on implementing a simple priority queue with queues, I talked about making it lock-free. Well, this article goes a little way towards that goal ("Uh, this is all about a stack, dummy, where's the queue? Let alone where's the priority queue?"), but we've a way to go yet. See you next time, when we'll talk about the lock-free queue.

Here's a bunch of links for you to tie all this lock-free stuff up.