Regular Expression: Negative Lookahead Part II

When we left off, I had set out to replace any ampersand outside a CDATA segment with the XML encoded version. Instead of ditching the regex approach and using a lenient DOM parser, I chose to escalate the ugly hack even further! The helpful folks over at stack overflow had a ready made solution:

    &(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)(?!(?>(?:(?!<!\[CDATA\[|\]\]>).)*)\]\]>)

I think this is self-explanatory. See you next time!

OK, maybe not. I don't know about you, but I certainly didn't grok that when I first saw it, or even an hour later. I think it helps to start with a simplified example. Instead of CDATA, let's say that the start token is "a", and the end token is "b". So, we would want the following matches:

 & 
 & a b
 a b &

 a & b a
 &
 b
 a & & b

My idea was to match any "&" that's not followed by a "b" without an "a" in between. The smallest regex I could come up with for this case is:

    &(?![^a]*b)

The only piece that might be news to the regex journeyman is the "(?!" syntax. This is a negative look ahead. In other words, this regex won't match an ampersand if it's not followed by a match for "[^a]*b". It works like a charm, but it's cheating. The solution is so simple because we're taking large advantage of a hidden assumption: that the start token specifically is exactly one character.

The "[^a]*" means any string of characters that is NOT "a". When you try to extend this to literal strings (such as "<![CDATA["), you will find that the only way to negate a literal is the aforementioned "(?!" syntax. Also, you must escape certain characters so they are evaluated as literals.

    (?!<!\[CDATA\[)

The "b" token would simply be the literal. If we simply replace "a" and "b", we get:

    &(?!(?!<!\[CDATA\[)*\]\]>)

Now, we have two look around clauses, which is confusing. But the bigger problem is that this doesn't actually work. This is because the (?!) clause itself isn't matching anything; it's a zero-width token. To actually progress the matching, we need to throw good old ".*" in there. But don't want the expanding match to include the look ahead clause, so we wrap the look ahead and the "." in it's own set:

    &(?!((?!<!\[CDATA\[).)*\]\]>)

Note: this regex requires that the optional flag period matches newlines is set to TRUE. This is not the default. You can also replace the lone "." in the regex with "(.|\s)".

That's the basic idea, anyway. The stack overflow script has some other stuff in it. It starts with "&(?!(?:[a-zA-Z][a-zA-Z0-9]*|#\d+);)", which is an attempt to not match ampersands that are already part of an XML encoded value. That's not necessary for my case, but nice to have. Their CDATA clause is also slightly different. But it's all just optimizations. Maybe we can explore this in a later post. For my part, I'll stick with the shorter, slightly more readable version at the cost of performance.

Still no word on a fix from HotJobs to actually return valid XML from their web-service.

This post made possible by the excellent utility RegexBuddy. Highly recommended.

Chase Seibert