Simple pattern matching

published: Mon, 15-Sep-2003   |   updated: Thu, 27-Oct-2005

The regular expression classes in the .NET Framework are all very well, but something you need something a little simpler to match a pattern to a string. Especially if your users are not wizards at constructing regular expressions.

There is a simpler language for defining patterns, wildcards, one that we've used from the old DOS days: a ? for a single character, and a * for zero or more characters. Any other character is counted as a literal. For example, the pattern "f?o*" will match the strings "foo", "food", and "flows", but not "goo" or "fob".

As it happens matching with such a pattern language is very simple. You take the pattern and the string to be matched, and start comparing a character from one to a character from the other. As you move through the strings, you'll tend to advance further through the match string than through the pattern, so let's assume that we are tracking the current index through each string.

For a literal character in the pattern at its current index, the match string must have the same character at its index. If they do, advance both indexes by one and try again. If they don't, we failed, of course, but more on that later.

For a ? character at the current index in the pattern, it doesn't matter what's at the current index in the match string, it matches. So, we advance both indexes on by one, and try again.

For a * character at the current index in the pattern, things get more interesting. Since * means "zero or more matches of any character", we could advance the pattern's index on by one and leave the match string's index alone and try again. Or, we could advance the match string's index by one, leave the pattern alone and try again.

All this description seems to indicate that we write a recursive method as follows:

public static bool DoMatch(string pattern, int patternIndex, string s, int sIndex) {
  switch (pattern[patternIndex]) {
    case '?' :
      return DoMatch(pattern, patternIndex + 1, s, sIndex + 1);
    case '*' : 
      bool testMatch = DoMatch(pattern, patternIndex + 1, s, sIndex);
      if (!testMatch) 
        testMatch = DoMatch(pattern, patternIndex, s, sIndex + 1);
      return testMatch;
    default : 
      if (pattern[patternIndex] != s[sIndex]) 
        return false;
      return DoMatch(pattern, patternIndex + 1, s, sIndex + 1);
  }
}

No, don't run it just yet! The reason for the admonishment is that this is a recursive method and we haven't taken any steps to avoid runaway recursion.

To avoid recursion we need to check that the pattern or the match string (or both) haven't been exhausted, that we haven't checked all the characters.

Let's suppose when the method is called deep in the recursion that the match string index is greater than the length of the string (in other words, the previous call to the method checked the final character in the string). If the pattern index is also greater than the length of the pattern, we can return true: the pattern and the match string were exhausted at the same time, therefore we had a positive match.

If we still have one or more characters left in the pattern, we have to check to see if they are all * characters. If there is another character altogether, we know we shall have to match it exactly to a character in the match string, but we're assuming that there are no others left. Here's that method:

private static bool CheckAllAsterisks(string s, int index) {
  for (int i = index; i < s.Length; i++) 
    if (s[i] != '*') 
      return false;
  return true;
}

Now, having looked at all the cases for when the match string is exhausted, we can assume that it is not and look at the cases for when the pattern string is exhausted. This is simple in the extreme: if the pattern is exhausted but the match string is not, they don't match.

The final DoMatch() method looks like this:

private static bool DoMatch(string pattern, int patternIndex, string s, int sIndex) {
  if (sIndex == s.Length) {
    if (patternIndex == pattern.Length) 
      return true;
    return CheckAllAsterisks(pattern, patternIndex);
  }

  if (patternIndex == pattern.Length) 
    return false;

  switch (pattern[patternIndex]) {
    case '?' :
      return DoMatch(pattern, patternIndex + 1, s, sIndex + 1);
    case '*' : 
      if (patternIndex == pattern.Length - 1)
        return true;
      bool testMatch = DoMatch(pattern, patternIndex + 1, s, sIndex);
      if (!testMatch) 
        testMatch = DoMatch(pattern, patternIndex, s, sIndex + 1);
      return testMatch;
    default : 
      if (pattern[patternIndex] != s[sIndex]) 
        return false;
      return DoMatch(pattern, patternIndex + 1, s, sIndex + 1);
  }
}

Notice that I also added a small optimization to check whether the last character in the pattern was a * character. If so, we know that the pattern will match the string, no matter what other characters are present in the string. I also wrote a calling method to start it all off (which explains the private access modifier above):

public static bool MatchString(string pattern, string s) {
  return DoMatch(pattern, 0, s, 0);
}