Content-Length: 67883 | pFad | http://lwn.net/Articles/430692/

Replacing regexps [LWN.net]
|
|
Subscribe / Log in / New account

Replacing regexps

By Jonathan Corbet
March 2, 2011
Regular expressions are a pain. Their power cannot be doubted; a regular expression can describe complicated text patterns in an exceedingly concise manner. Using regular expressions, a program can perform all kinds of string parsing and recognition tasks. But they are also difficult to write, difficult to read, difficult to understand, and difficult to debug. Any but the most trivial of regular expressions are quite likely to contain errors. So it is not surprising that developers would think about replacing them with something better. But, as a recent discussion in the Python community shows, that replacement, like regular expressions themselves, may be difficult.

The compactness of the regular expression syntax is part of their power, but also part of the problem. Consider even a very simple expression:

    <A\b[^>]*>(.*)</A>

A reader familiar with this syntax will recognize that this expression matches the HTML <A> tag and sets aside the anchor text for later processing. But even experienced regular expression developers must look at that expression for a moment and think about how the various metacharacters affect each other before being able to say for sure what it does. It takes even longer to notice the subtle bug: this expression will be confused by the presence of multiple <A> tags in the text being searched.

So how might one do better? That was Mike Meyer's question as he sought a more "pythonic" way of doing text matching. Needless to say, he is not the first to ask that kind of question; there are a number of attempts at better string matching out there. The first of those is arguably not "pythonic" at all: it is SnoPy, a port of the venerable SNOBOL language to Python.

SNOBOL was developed during the 1960's; it included pattern matching as a core feature of the language. Unlike regular expressions, SNOBOL was anything but concise. Concatenation of strings was explicit, "[abc]" was "Any("abc")", and so on. Nonetheless, SNOBOL was highly influential in this area, and one can see echoes of the language in current regular expressions. That said, SNOBOL is not heavily used now, and the Python SNOBOL module seems to have suffered the same fate; its last release was in 2002.

Another approach is the rxb.py module by Ka-Ping Yee. This module, posted in 2005, creates a new, relatively verbose but relatively readable language for the creation of patterns. Using this language, the regular expression shown above would look something like:

    <A + any(wordchars + whitespace)> + label(1, anychars) + </A>

(Note to readers; the above is totally untested and should not be relied upon for production use). This module, too, has not seen a great deal of use.

Various other packages are out there. For example, one can try to use Icon-style pattern matching with Python. For something completely different, there is the eGenix mxTextTools module, which allows the creation of text-matching programs in an assembly-like language, complete with goto constructs. mxTextTools is intimidating and not necessarily any easier to read than regular expressions, but it is said to be powerful and fast, and there are a number of real users.

Still, none of these seem likely to replace regular expressions as the first tool Python programmers reach for when they need to perform string matching. Python creator Guido van Rossum thinks things will stay that way:

I fear that regular expressions have this market cornered, and there isn't anything possible that is so much better that it'll drive them out.

Pushing aside an established incumbent is always hard, and regular expressions are well established indeed. It is never enough to simply be better in this situation; the proposed replacement has to be a lot better. As Guido noted, nothing seems to have come along which is that much better, and it may be that nothing ever will. For some medium-term value of "ever," anyway.

But, then, one also should not underestimate the ingenuity of free software developers. Or their persistence. People will almost certainly continue to throw themselves against this problem, and, maybe, somebody will come up with something interesting. Until then, we'll have to continue beating our heads against our desks as we try to figure out why our expressions don't work as intended.


to post comments

Replacing regexps

Posted Mar 3, 2011 2:34 UTC (Thu) by tshow (subscriber, #6411) [Link] (6 responses)

This is kind of funny, actually, because one of the big things that chased me away from Python was regular expressions. Or rather, my difficulty in figuring out which of the several RE libraries was not yet deprecated.

If you're going to eat the cost of doing something in a scripting language, then (IMO) that language had better have:

- garbage collection in the core system
- robust string handling including a full RE suite in the core language
- lambda or some useful equivalent in the core language
- an associative array type in the core language
- a sane roadmap

Python fails on the RE suite and the sane roadmap. In fairness, they have a relatively sane roadmap, but they followed their GPS into the weeds trying to get to 3.0 and are currently on what my wife refers to as a "crop tour".

I've been using ruby instead, and enjoying it greatly. Maybe I'll try python again when the 3.0 transition is complete, but if they still haven't nailed down what they want to do for REs that may be a legacy my grandkids will have to fulfill for me.

Replacing regexps

Posted Mar 3, 2011 3:33 UTC (Thu) by foom (subscriber, #14868) [Link] (3 responses)

Oh wow...when's the last time you used python, 15 years ago?? The module "re" has been *the* *one* regexp API since Python 1.5, released end of 1997!

Replacing regexps

Posted Mar 4, 2011 18:10 UTC (Fri) by tshow (subscriber, #6411) [Link] (2 responses)

> Oh wow...when's the last time you used python, 15 years ago??

2006 or so, actually. But while "re" may well have been the "one" regexp API, the others are (were?) all still there in the docs. At least at the time, it wasn't particularly obvious from the docs which library was the canonical one.

Replacing regexps

Posted Mar 4, 2011 20:14 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

Are you sure you're talking about Python at all?

Python 2.4 was the current version until 2.5 was released at the end of 2006, so, looking at its docs:
http://docs.python.org/release/2.4/

The only regular expression library mentioned is "re".

You actually have to go all the way back to 1.6 to find a reference to the old regex module:
http://docs.python.org/release/1.5/lib/node69.html#SECTIO...
http://docs.python.org/release/1.6/lib/module-regex.html

which says rather clearly:
"""
Obsolescence note: This module is obsolete as of Python version 1.5; [...]. All new code in need of regular expressions should use the new re module
"""

Replacing regexps

Posted Mar 7, 2011 3:10 UTC (Mon) by tshow (subscriber, #6411) [Link]

Perhaps I was, then. I'd thought it was around 2006, and I clearly remember reference to at least 3 RE libraries with some confusion as to which one was canonical.

That said, in retrospect I may also have been relying on O'Reilly books that may not have been up to date at the time, so the fault may have been mine.

Replacing regexps - Python

Posted Mar 5, 2011 0:16 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)

If you're going to eat the cost of doing something in a scripting language,

How is Python a scripting language?

I don't write Python, but I've worked with a fair amount of it, and it looks about as script-like as typical C code.

Replacing regexps - Python

Posted Mar 7, 2011 3:06 UTC (Mon) by tshow (subscriber, #6411) [Link]

> How is Python a scripting language?

To be clearer: "If you're going to eat the runtime cost of an interpreted language..."

I tend to use "scripting language" and "interpreted language" interchangeably, which I'll admit is arguably bad form.

Replacing regexps

Posted Mar 3, 2011 2:41 UTC (Thu) by mrons (subscriber, #1751) [Link] (2 responses)

Of course the perl6 design process had a look at regexps as well. This is a good read:

http://dev.perl.org/perl6/doc/design/apo/A05.html

Replacing regexps

Posted Mar 4, 2011 9:50 UTC (Fri) by jezuch (subscriber, #52988) [Link] (1 responses)

Actually, the Perl6 regexps are no longer regexps... AFAIR they added a full-blown grammar support! How awesome is that? :D

Replacing regexps

Posted Mar 9, 2011 2:42 UTC (Wed) by jtc (guest, #6246) [Link]

"Perl6 ... they added a full-blown grammar support! How awesome is that?"

It's very awesome! It's at a level of about 98.975%, where 100% is maximum possible awesomeness.

Replacing regexps

Posted Mar 3, 2011 3:44 UTC (Thu) by dw (subscriber, #12017) [Link] (1 responses)

If performance isn't a primary concern, PyParsing is worth a gander. While its reuse of Python syntax can be a bit confusing, the result is quite readable even for more complex grammars.

Replacing regexps

Posted Mar 4, 2011 11:45 UTC (Fri) by debacle (subscriber, #7114) [Link]

Yes, I can only confirm this. PyParsing is IMHO the most powerful and most Pythonic parsing library for parsing any kind of string. The last time I tried (about three years back), performance was terrible, however.

Replacing regexps

Posted Mar 3, 2011 4:29 UTC (Thu) by JoeBuck (subscriber, #2330) [Link] (4 responses)

I don't think that it's really a question of replacing regexps, the underlying mathematical construct. Instead, maybe we should think about alternative source representations that are easier to understand and maintain. These alternative representations would still be compiled into the same automata that regular expression libraries use. Traditional regexps would still be the best choice for simple cases, but for more complex cases the traditional regexp would play a role similar to assembly language.

Replacing regexps

Posted Mar 3, 2011 5:25 UTC (Thu) by wahern (subscriber, #37304) [Link] (2 responses)

Maybe someone could create a Python output module for Ragel.

http://www.complang.org/ragel/

You can already inline Ragel code with C, C++, Objective-C, D, Java and Ruby source code. Ragel is a DFA-based regex compiler which, among other things, has a syntax for named expressions.

The example from the article would look something like:

wordchars = alnum | '"' | "'" | "=";

whitespace = " " | [\t\n\r];

text = [^<]*;

A_begin = "<A" . (wordchars|whitespace)* . ">"
A_close = "</A>";

match = A_begin . text . A_close;

Of course, pure regular expressions can't match recursive grammars like HTML. DFA implementations can be orders of magnitude faster than NFA implementations (like Python, et al), and provide guaranteed worst case performance; but it can't add all the extra bells and whistles other implementations provide. However, Ragel has other tricks up its sleeves, like state charts, and the ability execute code or jump into and out of matching machines at arbitrary points. But using these features means you probably can't just map the representation to existing Python engines.

Replacing regexps

Posted Mar 3, 2011 23:59 UTC (Thu) by Simetrical (guest, #53439) [Link]

HTML isn't a recursive grammar. It doesn't have any grammar. As now implemented in the latest Firefox, Chrome/Safari, and Opera builds (IE will doubtless follow), its syntax is defined algorithmically:

http://www.whatwg.org/specs/web-apps/current-work/multipa...

Where by "algorithmically" I mean "by pseudocode measuring about 90 printed pages".

Maybe you were thinking about XML.

Replacing regexps

Posted Mar 5, 2011 16:54 UTC (Sat) by Tet (subscriber, #5433) [Link]

Maybe someone could create a Python output module for Ragel

Yes please. It's long been on my todo list, but given my current lack of free time, it looks like it'll stay there for some time.

Replacing regexps

Posted Mar 3, 2011 13:11 UTC (Thu) by nix (subscriber, #2304) [Link]

Yes indeed. Olin Shivers' SRE notation springs to mind. It probably doesn't fit well with any non-Lisp-like language, but it's wonderful to use with those. (Emacs's rx.el is quite compatible with it.)

Replacing regexps

Posted Mar 3, 2011 4:45 UTC (Thu) by jzbiciak (guest, #5246) [Link] (9 responses)

The example immediately reminded me of the warning that you can't correctly parse general (X)HTML with a regexp, because the language isn't regular.

Replacing regexps

Posted Mar 3, 2011 5:32 UTC (Thu) by wahern (subscriber, #37304) [Link] (6 responses)

You can with PEGs (Parsing Expression Grammars), and Python has at least one decent implementation--PyPEG--with a syntax almost like that suggested.

http://fdik.org/pyPEG/

pegs are the bee's knees

Posted Mar 3, 2011 16:34 UTC (Thu) by wingo (guest, #26929) [Link]

Yay PEGs! Everyone should use PEGs.

Sorry for the content-less pile-on but PEGs are pretty sweet and folks should check them out.

Replacing regexps

Posted Mar 4, 2011 0:02 UTC (Fri) by Simetrical (guest, #53439) [Link] (4 responses)

PyPEG seems to be only for parsing languages that can be described with BNF. It can presumably parse X(HT)ML, but HTML can't even remotely be described with BNF. You should use html5lib for parsing HTML from Python if you want it to work reliably on web pages, or if you want it to work the same as browsers work: http://code.google.com/p/html5lib/

Replacing regexps

Posted Mar 4, 2011 18:44 UTC (Fri) by wahern (subscriber, #37304) [Link] (3 responses)

I never understood the WHATWG's contempt for XML. It's one thing to create a standard with an eye to malformed content; it's another to be so contemptuous of XML. IIRC, in the beginning they specifically rejected an XHTML5 mode; now it's just the red-headed step child.

I'm not the world's biggest fan of XML as a use-everywhere markup language. But when it comes to the processing pipeline of web page generation, XML is a beautiful thing.

Replacing regexps

Posted Mar 4, 2011 20:19 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

Now that the parsing of HTML is actually fully specified, HTML is a perfectly good format for interchange in its own right. You can make and manipulate in a node tree parsed from HTML in a processing pipeline (via html5lib) just as easily as if the it was rendered in XML.

So, making everyone support a *second* markup format everywhere (because, of course, HTML support can't possibly be dropped) really doesn't buy you much. XHTML should probably just be discarded as a failed idea.

Replacing regexps

Posted Mar 4, 2011 21:17 UTC (Fri) by Simetrical (guest, #53439) [Link]

To be fair, I wouldn't describe HTML as "a perfectly good format" so much as "a nightmare format from hell that happens to be well-defined enough that you can now reliably use it if absolutely necessary, such as if your boss would fire you otherwise". It's very easy to create markup that will result in extremely surprising DOMs, like if you misnest something near a table. Just because handling of parse errors is well-defined doesn't mean it's sane.

But for web page use, that pales in comparison to XML's flaws (even leaving aside lack of support in IE < 9). I agree that at this point, there's really no reason for anyone to use XHTML, if there ever was. The HTML5 spec only allows it because it's just a few extra paragraphs, and does a lot to shut up XML diehards who will reject tag soup on ideological principles.

Replacing regexps

Posted Mar 4, 2011 21:11 UTC (Fri) by Simetrical (guest, #53439) [Link]

The contempt for XML can be boiled down to the following points, in decreasing order of importance:

1) For some years, there was a strong political faction that insisted on XML for everything, to the exclusion of all else. This faction took control at the W3C and was so radical that it refused to even permit work to happen at the W3C on improving non-XML versions of HTML. This is the entire reason the WHATWG exists in the first place: browser implementers determined that they had to continue adding features to non-XML versions of HTML, but the W3C refused to cooperate.

2) XML is fragile. It works great if you write it using only XML tools, which are guaranteed to output well-formed XML, but in practice, it's often much easier to use string concatenation, and you want to edit it by hand, etc. When people try to do this with XML, they routinely break well-formedness and the entire page fails to display. This can even happen to well-designed applications -- I've seen with my own eyes a yellow screen of death when using Firefox's built-in directory browser to browse my /tmp, because I had a file there with a control character in its name, and Firefox passed that into an XML file, which thus wasn't well-formed. You only need one place where you allow strings to creep in without checking them for your entire program to break.

3) XML is, in some ways, gratuitously more complicated or harder to work with than HTML. Namespaces are one giant wart that XML introduced, which complicate everything related to the DOM but actually wind up being totally useless baggage for web browsers. Entity references are another headache: in HTML there's a fixed list, but XML requires a DTD if you use named entities and want to maintain well-formedness. The DTD must in practice be fetched from a server before you can parse the page, unless you happen to have hardcoded that DTD into your application.

As far as the processing pipeline of web page generation, HTML5 parsers now exist for some common languages (like Python). These are drop-in replacements for DOM/SAX/etc.-based XML parsers, so an app that works with XML can support text/html with no extra effort if the library is available. That removes almost the entire reason for preferring XML for web pages, or will once HTML5 parsers are more widespread.

Replacing regexps

Posted Mar 3, 2011 16:15 UTC (Thu) by cdmiller (guest, #2813) [Link]

So stuff like this crops up (perl6):

grammar XML {

token TOP { ^ <xml> $ };
token xml { <text> [ <tag> <text> ]* };
token text { <-[<>&]>* };

rule tag {
'<'(\w+) <attributes>*

[
| '/>' # a single tag
| '>'<xml>'</' $0 '>' # an opening and a closing tag
]

};

token attributes { \w+ '="' <-["<>]>* '"' };
};

Replacing regexps

Posted Mar 3, 2011 16:26 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Judging by the replies, I think at least some missed that I was mainly trying to be funny.

Generative regular expressions

Posted Mar 3, 2011 10:28 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

In CS theory, a regular expression is often thought of as a way to generate strings, rather than to test a given string. So a regexp corresponds to a (possibly infinite) set of strings which it generates. The two definitions are equivalent, although the common grep-like behaviour of scanning through a string for a match at any point muddies the waters a bit.

A useful debugging tool might be to take a regexp and generate all strings it matches (or the thousand shortest such strings). You could then check for unexpected behaviour.

Generating strictly every string (with almost all possible Unicode characters for an occurrence of '.', for example) would not be necessary, instead 'a', 'b', 'c' could be used to stand for any character, provided these characters do not appear in the regexp or elsewhere in the source string.

Generative regular expressions

Posted Mar 3, 2011 20:32 UTC (Thu) by dtlin (subscriber, #36537) [Link]

Something like Regexp::Genex?

Replacing regexps

Posted Mar 3, 2011 16:53 UTC (Thu) by martinfick (guest, #4455) [Link] (4 responses)

I suspect there is more than just one subtle bug in that expression. But, there are a few that I would consider not subtle, but actually rather obvious, including the one mentioned in the article (multiple A tags).

For starters, rather even more obvious, most A tags also have an href="" in them, and the presence of a > in those quotes would likely also break things. I suspect that a fair amount of mailto: URLs have > in them. Of course, plenty of illegal URLs and other attributes with > in them would also break things.

Another obvious one: a closing </A> on a different line is likely to confuse some RE implementations if not told to span newlines.

I also suspect that whitespace (including newlines) before and after the As are legal also, and missed by this expression?

What all this points out to me is not so much a problem with REs, but a problem with trying to parse a complex format like html. I fail to see how a better parsing language would help to avoid the problems mentioned above if someone doesn't think about those possibilities. But I can see how a better approach might at least help express a solution to the above problems more easily than REs can.

Replacing regexps

Posted Mar 4, 2011 0:25 UTC (Fri) by Simetrical (guest, #53439) [Link] (3 responses)

If we're talking about text/html here instead of XML, the number of ways this could theoretically not parse a tag the same way a browser would is probably very large. For one fun example, browsers will auto-close the <a> tag in some cases. "<a>Foo<a>Bar" will parse the same as "<a>Foo</a><a>Bar", as you can see by visiting this URL (in a browser other than IE; ignore LWN's helpful auto-linking):

data:text/html,<!doctype html><a href=http://google.com>Foo<a>Bar

Notice how only "Foo" is a link. Or how about this:

data:text/html,<!doctype html><a href=http://google.com>Foo<xmp></a></xmp>bar

where the end tag is ignored because it's inside <xmp>. <plaintext> would work similarly. There are probably lots of really fun examples that I could come up with if I were an HTML parser expert.

Of course, these aren't "valid" HTML, but this stuff exists in real web pages and the HTML specification says how to parse it and that's how browsers do parse it, so . . .

Replacing regexps

Posted Mar 4, 2011 5:12 UTC (Fri) by nicooo (guest, #69134) [Link] (2 responses)

It depends on what you're trying to do. Someone grepping for links on a site they are designing doesn't need to worry about parsing every page on the internet or about getting a few false positives.

Replacing regexps

Posted Mar 4, 2011 17:14 UTC (Fri) by Simetrical (guest, #53439) [Link] (1 responses)

No, of course not. Often it's easiest to use regex and just accept that it will break sometimes and you'll have to fix it, rather than pulling out an HTML5 parser and messing around with a DOM (which is also much more computationally expensive). But it's good to be aware that HTML is an extremely complicated format and it's impossible to parse in general except with a full-fledged HTML parser.

Replacing regexps

Posted Mar 4, 2011 20:08 UTC (Fri) by dmarti (subscriber, #11625) [Link]

I like...
    d = BeautifulSoup(string)
    for atag in d.findAll('a'):
another POV on this issue: "You can't parse [X]HTML with regex."

Replacing regexps

Posted Mar 5, 2011 16:41 UTC (Sat) by cras (guest, #7000) [Link] (6 responses)

I wonder if it would be useful to have some kind of a utility that colorizes regexps and maybe splits them out to multiple lines (e.g. stuff inside (..) would be in its own line). It could help with reading and understanding it faster. So usage would be like:

echo "regexp" | regexp-colorize

I guess someone's already done something like that.

Replacing regexps

Posted Mar 7, 2011 9:19 UTC (Mon) by dtlin (subscriber, #36537) [Link] (3 responses)

$ perl -Mre=debugcolor -e'qr/foo(..)bar/'
Compiling REx "foo(..)bar"
Final program:
   1: EXACT <foo> (3)
   3: OPEN1 (5)
   5:   REG_ANY (6)
   6:   REG_ANY (7)
   7: CLOSE1 (9)
   9: EXACT <bar> (11)
  11: END (0)
anchored "foo" at 0 (checking anchored) minlen 8 
Freeing REx: "foo(..)bar"

Replacing regexps

Posted Mar 7, 2011 17:53 UTC (Mon) by cras (guest, #7000) [Link] (2 responses)

I was thinking about something that would make it quicker to understand what a regexp does to a person who already understands regexps well. /foo(..)bar/ is far quicker for me to read and parse then the verbose debug output.

I was thinking like for example /foo(bar\((.*)baz)/ could be outputted with:

foo(
  bar\((
    .*)
  baz)
Where \( would be colored differently to make it clearly distinct from the '('. And other regexp characters could be colored with different colors as well. This could of course use more thought to make it usable when regexps contain spaces.

Replacing regexps

Posted Mar 8, 2011 7:04 UTC (Tue) by spaetz (guest, #32870) [Link] (1 responses)

Like the python .VERBOSE option?

This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.VEBOSE)

Replacing regexps

Posted Mar 8, 2011 11:59 UTC (Tue) by anselm (subscriber, #2796) [Link]

The same thing also works in Perl, and has done so for the last 15 years or so – I just checked my second-edition copy of »Programming Perl«, which came out in 1996 and explains the feature.

Replacing regexps

Posted Mar 10, 2011 14:35 UTC (Thu) by tmg (guest, #46149) [Link]

There is actually a GUI that lets you visualize, edit and even test regular expressions: KRegexpeditor

See http://www.stat.tamu.edu/~dahl/teaching/605/3.8/kregexped...

Replacing regexps

Posted Mar 17, 2011 10:13 UTC (Thu) by Sho (subscriber, #8956) [Link]

Related: In Python, you can pass re.DEBUG to re.compile, and it will print the regex parse tree. An example: http://stackoverflow.com/questions/101268/hidden-features...

Replacing regexps

Posted Mar 27, 2011 7:32 UTC (Sun) by sensille (subscriber, #73175) [Link]

"There is a theory which states that if ever anyone explains exactly what Regular Expressions are and why they are here, they will instantly disappear and be replaced by something even more bizarre and inexplicable.

There is another theory which states that this has already happened"

(based on Douglas Adams)


Copyright © 2011, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://lwn.net/Articles/430692/

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy