julian m bucknall >> Writing a parser for CSV data

Writing a parser for CSV data

published: Mon, 11-Jul-2005 | updated: Thu, 27-Oct-2005

(Note: this article was published in an earlier form on Falafel's web site. It as been expanded and the code rewritten for this version.)

Introduction

One of the simplest parsing jobs we can undertake is to read a CSV (comma separated values) file as a set of records, and to retrieve the individual fields in each record.

For those who haven't had to deal with these types of files before, a CSV file is a simple text file. Each line in the file in general is a separate record, with each record being partitioned as a set of fields, each field separated from its neighbor by a comma (hence, comma separated values). Everything is normal text, no control characters apart from line breaks, just simple single-byte characters (like ASCII). Many programs—such as Excel—can produce CSV files; they're a common, simple method of transferring data between applications. Unlike XML though (another, fairly simple, data exchange format), there is no way to define the character set being used in the CSV file. So, if you do have to parse CSV files, check to see the character set being used by the data source.

Each record is usually delimited by a line break, such as a line feed character (0x0A) or a carriage return/line feed pair (0x0D, 0x0A). Note though that the definition of the CSV format allows for a field within a record to contain line breaks (we'll see how in a moment).
All fields are strings. (If a field value represents a different data type, such as an integer value, it is up to the consumer of the CSV file to convert it.)
Fields within a record are delimited with commas.
Any spaces adjacent to a comma separator are ignored.
Field values containing spaces, or that have significant leading or trailing spaces must be enclosed in double quotes.
Field values containing line breaks must be enclosed in double quotes.
Field values containing commas must be enclosed in double quotes.
Field values containing double quotes must have each double quote character escaped by doubling it.
If a field is enclosed in double quotes, the quotes will be discarded when the field is read.
Fields can be empty or null by having no data between the commas.

Here is an example that illustrates many of these rules in a CSV record.

"boyet.com", 48 , ,"Saturday, April 23, 2005", "Mack ""The Knife"""

There are 5 fields in this record, and all are strings (even the one that looks like a number will be returned as a string):

boyet.com
48
<null>
Saturday, April 23, 2005
Mack "The Knife"

Notice that the enclosing double quotes have been stripped from the first field (they weren't really needed in the first place), that the spaces around the second field have been stripped, that the third field is missing (indicated here by <null>), that the fourth field still contains its commas, and that the fifth field's escaped double quotes have been reduced to single ones.

So. how would we read such a record? The first thing we should do is think about the actual format of the CSV record. We should, as the computer scientists would have us say, write down the grammar for CSV records. A grammar is merely a description of how the various pieces in our language (the CSV record) are put together, and what makes sense and what doesn't. (The computer scientists say that the grammar generates the language.) The grammar is described above in our list of what works in a CSV file and what doesn't (the requirements for a CSV file, if you like), but it's better to put it in a more formal way.

CSV grammar

The first grammar rule for a record is the one about comma delimiters between fields. If you like, a record is either a single field, or is a field, followed by a comma, followed by another field, followed by a comma, and so on... If we invent the notion of a "CSV string list" we could write

A CSV file is a set of zero or more records, followed by the end of file.
A record is a CSV string list followed by a line break.
A CSV string list is a raw string, or is a raw string followed by a comma followed by a CSV string list.

Seems pretty obvious when you see it now, although it's a little tricky/clever defining a CSV string list recursively like that (otherwise we'd have to resort to a definition that tried to encapsulate the "many fields, many commas" requirement with ellipses or something). So, now we have to define a raw string.

A raw string consists of optional spaces, or optional spaces followed by a raw field followed by optional spaces.

OK. Pretty good. so, what's optional spaces?

Optional spaces consist of zero or more whitespace characters.
A whitespace character is either space (0x20) or tab (0x09).

That last rule is complete: we've defined it perfectly. In fact, it's known as a terminal rule since we cannot subdivide it further. Next up is the definition of a raw field, and the further definitions that it requires:

A raw field is either a simple field or a quoted field.
A simple field consists of a series of one or more 8-bit characters. None of these can be space, tab, newline, comma, or double quote.
A quoted field consists of a double quote, followed by an escaped field, followed by a double quote.
An escaped field consists of a subfield or a subfield followed by two double quotes followed by an escaped field. \
A subfield is any series of one or more characters not including a double quote.

In that set of definitions we have a few terminal rules as well as another recursive definition. However, looking back at the complete set of rules, you get the feeling that it's still all a little too verbose.

Usually, when you define a grammar like this you use a special shorthand. One of the better known shorthands is BNF (Backus Naur Form). Here's the same rules expressed in BNF.

csvFile ::= (csvRecord)* 'EOF'
csvRecord ::= csvStringList ('\n' | 'EOF')
csvStringList ::= rawString [',' csvStringList]
rawString := optionalSpaces [rawField optionalSpaces)]
optionalSpaces ::= whitespace*
whitespace ::= ' ' | '\t'
rawField ::= simpleField | quotedField 
simpleField ::= (any char except \n, EOF, \t, space, comma or double quote)+
quotedField ::= '"' escapedField '"'
escapedField ::= subField ['"' '"' escapedField]
subField ::= (any char except double quote or EOF)+

From reading this, I'm sure that you can infer that the ::= operator means "is defined as"; | means OR; * means zero or more repetitions, and + means one or more (like regular expressions, in fact). One you may not know is an expression in square brackets: this means that the part between the brackets occurs zero or one times, or, if you like, it's optional. Sentences in parentheses obviate the need for us to list all of the possibilities. 'EOF' is end of file, a pseudo character.

In general, we try and remove the recursive rules in our grammar, those where the definition of a rule involves itself. An example above is for csvStringList. It's defined as a rawString optionally followed by a comma followed by a csvStringList again. We can easily remove the recursive calls (they'll in fact be converted to a while loop as we'll see) by this algorithm.

Suppose the rule is "a is defined as b, followed optionally by c followed by a again":

a ::= b [c a]

This can be rewritten as

a ::= b | b c a

In other words, a is either just b on its own, or it's b followed by c followed by a.

Expand it one level, by substituting for a:

a ::= b | b c b | b c b c a

And again, to see the pattern:

a ::= b | b c b | b c b c b | b c b c b c a

As I'm sure you can now see, this can be more tersely written as

a ::= b (c b)*

In other words a is defined as b followed by zero or more sequences of (c b). Using this technique we can now rewrite the a couple of the rules

csvStringList ::= rawString (',' rawString)*
escapedField ::= subField ('"' '"' subField)*

Translating the grammar to code

The amazing thing about BNF is that it's pretty easy to translate it into code. Let's see how.

What we do is to convert the definitions into separate methods on a class. Let's assume that the class is called CsvParser. Now take the first definition. We write a method of CsvParser called parseCsvFile(). What's the implementation of parseCsvFile()? Well, first it needs to test if the end of file has been reached (in which case the file is empty and there are no records). If not it needs to call a method called parseCsvRecord() to parse out the next record. After the call to parseCsvRecord, it checks for the end of file again. This continues until end of file is reached. This is what I wanted to write:

private void parseCsvFile(TextReader reader) {
  while (reader.Peek() != EOF) {
    parseCsvRecord(reader);
  }
}

This causes me one problem: TextReader has no Peek() method. I can't Read() the character either: if it's not end of file, it belongs to the first field of the next record. I can't stuff it back into the TextReader somehow.

No, we're going to have to bite the bullet and write a more flexible character reader class into which we can push back characters we don't need and from which we can peek at the next character. In our case, the maximum number of characters we'd like to peek or push back at any one time will be just the one. "Read the next character. Is it one of those we're looking for? No. Oops, let's push it back."

I decided to implement this via an interface, ICharTokenizer, and with a first implementation being a class that gets its data from a string. The following code shows this interface and class. Like its name suggests, the interface returns characters as individual tokens.

public interface ICharTokenizer {
  char Peek();
  char Read();
  void Unread(char c);
}  

  public class StringCharTokenizer : ICharTokenizer {
    private string s;
    private int index;
    private bool haveUnreadChar;
    private char unreadChar;

    public StringCharTokenizer(string s) {
      this.s = s;
      index = 0;
      haveUnreadChar = false;
    }

    private void skipCrInCrLf() {
      if ((s[index] == '\r') &&
          (index + 1 < s.Length) &&
          (s[index + 1] == '\n'))
        index++;
    }

    private char mapCrToLf(char c) {
      if (c == '\r')
        return '\n';
      return c;
    }

    public char Peek() {
      if (haveUnreadChar)
        return unreadChar;
      if (index < s.Length)
        return mapCrToLf(s[index]);
      return CsvConstants.EOF;
    }

    public char Read() {
      if (haveUnreadChar) {
        haveUnreadChar = false;
        return unreadChar;
      }
      if (index < s.Length) {
        skipCrInCrLf();
        return mapCrToLf(s[index++]);
      }
      return CsvConstants.EOF;
    }

    public void Unread(char c) {
      if (haveUnreadChar) {
        throw new CharTokenizerException(
           "Unread() cannot accept more than one pushed back character");
      }
      haveUnreadChar = true;
      unreadChar = c;
    }
  }

It's a little messier than you'd imagine because I'm doing some conversions on the fly for the end of a line. In the UNIX world an end of a text line is \n. In the Windows world, it's \r\n and in the Mac world it's \r. What is happening in Peek() and Read() is that all three of these possibilities are getting converted to \n. This makes our parser easier to write.

So the parseCsvFile() method becomes:

private void parseCsvFile(ICharTokenizer reader) {
  while (reader.Peek() != CsvConstants.EOF) {
    parseCsvRecord(reader);
  }
}

Next in the grammar is a record; we implement this in the obvious way:

private void parseCsvRecord(ICharTokenizer reader) {
  parseCsvStringList(reader);
  char ch = reader.Read();
  if (ch == CsvConstants.EOF) {
    reader.Unread(ch);
    ch = '\n';
  }
  if (ch != '\n') {
    throw new CsvParserTooMuchDataException(
       "End of record was expected but more data exists.");
  }
}

Here we parse a string list by calling parseCsvStringList() and then we read the next character, which we're expecting to be a newline. If it isn't, the data in the file is invalid and so we signal it by throwing an exception. The next method to write then is obviously parseCsvStringList():

private void parseCsvStringList(ICharTokenizer reader) {
  char ch;
  do {
    parseRawString(reader);
    ch = reader.Read();
  } while (ch == ',');
  reader.Unread(ch);
}

In this method we parse a raw string, and then we look to see if the next character is a comma. If it is, great, we parse another raw string, and continue like this in a loop until we hit a character that's not a comma. At that point we have to push the not-a-comma character back into the tokenizer.

Your turn. Think about how you'd write the next parsing method on the list, parseRawString(), and try it out before coming back and comparing it to my answer.

private bool isFieldTerminator(char c) {
  return ((c == ',') || (c == '\n') || (c == CsvConstants.EOF));
}

private bool isSpace(char c) {
  return ((c == ' ') | (c == '\t'));
}

private void parseOptionalSpaces(ICharTokenizer reader) {
  char ch;
  do {  
    ch = reader.Read();
  } while (isSpace(ch));
  reader.Unread(ch);
}

private void parseRawString(ICharTokenizer reader) {
  parseOptionalSpaces(reader);
  parseRawField(reader);
  if (!isFieldTerminator(reader.Peek()))
    parseOptionalSpaces(reader);
}

The parseRawField() method poses a small problem: how do we tell whether a raw field is a simple field or a quoted field or just not there at all? (Check the grammar: a raw string must at least comprise an optional set of spaces, the remainder of its definition is also optional. In other words, a raw string could be entirely optional or not there at all.) The parseRawField() method must check that the initial character is not a character that appears at the end of the field.

Here's the implementation of parseRawField(). What I've assumed here is that parseRawField() will want to get the field value as a string, and will do something with it, eventually.

private void parseRawField(ICharTokenizer reader) {
  string fieldValue = String.Empty;

  char ch = reader.Peek();
  if (!isFieldTerminator(ch)) {
    if (ch == '"')
      fieldValue = parseQuotedField(reader);
    else 
      fieldValue = parseSimpleField(reader);
  }
}

The rest of the code should be fairly obvious by now. In writing it it's just a case of methodically converting the grammar to code. Don't get too fancy. Don't get worried at this stage about performance. What we want is a working parser; once we have that, we'll have plenty of time and tests to tweak it and validate it.

private string parseQuotedField(ICharTokenizer reader) {
  reader.Read(); // read and discard initial quote

  string field = parseEscapedField(reader);

  char ch = reader.Read();
  if (ch != '"') {
    reader.Unread(ch);
    throw new CsvParserNoTermQuoteException(
        "Quoted field has no terminating double quote");
  }
  return field;
}

private string parseEscapedField(ICharTokenizer reader) {
  StringBuilder sb = new StringBuilder();

  parseSubField(reader, sb);
  char ch = reader.Read();
  while (processDoubleQuote(reader, ch)) {
    sb.Append('"');
    parseSubField(reader, sb);
    ch = reader.Read();
  }
  reader.Unread(ch);

  return sb.ToString();
}

private void parseSubField(ICharTokenizer reader, StringBuilder sb) {
  char ch = reader.Read();
  while ((ch != '"') && (ch != CsvConstants.EOF)) {
    sb.Append(ch);
    ch = reader.Read();
  }
  reader.Unread(ch);
}

private bool isBadSimpleFieldChar(char c) {
  return isSpace(c) || isFieldTerminator(c) || (c == '"');
}

private string parseSimpleField(ICharTokenizer reader) {

  char ch = reader.Read();
  if (isBadSimpleFieldChar(ch)) {
    reader.Unread(ch);
    return String.Empty;
  }

  StringBuilder sb = new StringBuilder();
  sb.Append(ch);
  ch = reader.Read();
  while (!isBadSimpleFieldChar(ch)) {
    sb.Append(ch);
    ch = reader.Read();
  }
  reader.Unread(ch);

  return sb.ToString();
}

private bool processDoubleQuote(ICharTokenizer reader, char ch) {
  if ((ch == '"') && (reader.Peek() == '"')) {
    reader.Read(); // discard second quote of double
    return true;
  }
  return false;
}

Of course, having written all that private code, we now need a method to start the whole parser off. It's simplicity itself:

public void Parse(ICharTokenizer reader) {
  parseCsvFile(reader);
}

Getting the output from the parser

At this stage of the game what do we have? Well, we have a working character tokenizer and we have a parser class that can parse a CSV file through the tokenizer into the field values. The parser has the required error checking as well, throwing an exception should the data be invalid in some way.

What we don't have is something that can take the field values found by the parser and do something with them (like write them out to the console, or put them in a grid, for example). Since there could be many possible uses of the field values, it doesn't make sense to jam that knowledge into the parser. It's pretty clean at the moment.

What we should do is to create an interface that can consume the data and state information produced by the parser, and pass an instance of it to the parser. That way, all the parser knows regarding the consumer of the data it finds is the interface; it doesn't know anything concrete about who or what is using the data. Nicely decoupled code, in other words.

What kind of data and information would the consumer like to know, do you think?. I can think of a few items: the individual field values, obviously; when an end of record is encountered so that the previous record's fields can be processed as a set; and when the end of file is reached. It's unlikely that knowing when the field separators are found is important, and it would be downright idiotic if the discarded whitespace was needed. We can write the interface like this:

public interface ICsvConsumer {
  void ConsumeField(string s);
  void SignalEndOfRecord();
  void SignalEndOfFile();
}

A simple example consumer class is one that writes everything to the console:

public class CsvConsumerConsole : ICsvConsumer {

  public void ConsumeField(string s) {
    Console.WriteLine('[' + s + ']');
  }

  public void SignalEndOfRecord() {
    Console.WriteLine("[end of record]");
  }

  public void SignalEndOfFile() {
    Console.WriteLine("[end of file]");
  }
}

We would then pass an instance of this interface to the parse methods that need it, like this:

public void Parse(ICharTokenizer reader, ICsvConsumer consumer) {
  parseCsvFile(reader, consumer);
}

private void parseCsvFile(ICharTokenizer reader, ICsvConsumer consumer) {
  while (reader.Peek() != CsvConstants.EOF) {
    parseCsvRecord(reader, consumer);
  }
  consumer.SignalEndOfFile();
}

private void parseCsvRecord(ICharTokenizer reader, ICsvConsumer consumer) {
  parseCsvStringList(reader, consumer);
  char ch = reader.Read();
  if (ch == CsvConstants.EOF) {
    reader.Unread(ch);
    ch = '\n';
  }
  if (ch != '\n') {
    throw new CsvParserTooMuchDataException(
        "End of record was expected but more data exists.");
  }
  consumer.SignalEndOfRecord();
}

private void parseCsvStringList(ICharTokenizer reader, ICsvConsumer consumer) {
  char ch;
  do {
    parseRawString(reader, consumer);
    ch = reader.Read();
  } while (ch == ',');
  reader.Unread(ch);
}
  
private void parseRawString(ICharTokenizer reader, ICsvConsumer consumer) {
  parseOptionalSpaces(reader);
  parseRawField(reader, consumer);
  if (!isFieldTerminator(reader.Peek()))
    parseOptionalSpaces(reader);
}

private void parseRawField(ICharTokenizer reader, ICsvConsumer consumer) {
  string fieldValue = String.Empty;

  char ch = reader.Peek();
  if (!isFieldTerminator(ch)) {
    if (ch == '"')
      fieldValue = parseQuotedField(reader);
    else 
      fieldValue = parseSimpleField(reader);
  }
  consumer.ConsumeField(fieldValue);
}

Actually this design gives us a major benefit. So far, we've been a little remiss: we have not written a whole lot of code to test the parser. What we would like to do is to write some unit test code with NUnit (a well-known .NET unit testing framework) to thoroughly check our code. Since all unit tests have to be run in an automated fashion, often and regularly (in other words, we cannot rely on manual checking of outputs from inputs), we need to be able to write code that checks that the field values found by our parser are in fact the ones we're expecting.

...And the simplest way to do that, now we've defined the ICsvConsumer interface, is to write a simple consumer class that does the testing for us. The one I came up with was this one:

public class CsvConsumerTester : ICsvConsumer {

  private string[] expectedFields;
  private int fieldIndex;

  public CsvConsumerTester(params string[] expectedFields) {
    this.expectedFields = expectedFields;
  }

  public void ConsumeField(string s) {
    if (expectedFields[fieldIndex] != s) {
      throw new AssertionException(
         string.Format("field [{0}] expected, but [{1}] returned", 
         expectedFields[fieldIndex], s));
    }
    fieldIndex++;
  }

  public void SignalEndOfRecord() {
    if (expectedFields[fieldIndex++] != "eor") {
      throw new AssertionException(
          "End of record signalled but not expected");
    }
  }

  public void SignalEndOfFile() {
    if (expectedFields[fieldIndex++] != "eof") {
      throw new AssertionException(
          "End of file signalled but not expected");
    }
  }
}

As you can see, we create an instance of this class by passing in a string array of the field values we expect to find. The parser will call the interface's methods when it needs to, and this class will check to see whether the values and signals found were the ones expected.

Here's a typical test case using this class.

[Test]
public void TestSingleRecordManyFields() {
  CsvParser cp = new CsvParser();
  StringCharTokenizer sct = new StringCharTokenizer("a,b,c");
  CsvConsumerTester cct = new CsvConsumerTester("a", "b", "c", "eor", "eof");

  cp.Parse(sct, cct);
}

Summary

You can get the entire source code and NUnit tests from here.

Now that we are at the end of the article, what have we learnt? We've learnt that writing parsers can be fairly easy, providing that we take the time to devise a valid grammar for the text we want to parse. Sometimes writing the grammar is a non-trivial exercise, but once it's written in a BNF format, we can translate it into code without really breaking into a sweat. On the way we discovered a couple of techniques that can isolate the parsing part from the input part (the tokenizer) and the output part (the consumer), leading to less coupled code that is easier to write, understand, test, and maintain.