Replacing regexps
The compactness of the regular expression syntax is part of their power, but also part of the problem. Consider even a very simple expression:
<A\b[^>]*>(.*)</A>
A reader familiar with this syntax will recognize that this expression matches the HTML <A> tag and sets aside the anchor text for later processing. But even experienced regular expression developers must look at that expression for a moment and think about how the various metacharacters affect each other before being able to say for sure what it does. It takes even longer to notice the subtle bug: this expression will be confused by the presence of multiple <A> tags in the text being searched.
So how might one do better? That was Mike Meyer's question as he sought a more "pythonic" way of doing text matching. Needless to say, he is not the first to ask that kind of question; there are a number of attempts at better string matching out there. The first of those is arguably not "pythonic" at all: it is SnoPy, a port of the venerable SNOBOL language to Python.
SNOBOL was developed during the 1960's; it included pattern matching as a core feature of the language. Unlike regular expressions, SNOBOL was anything but concise. Concatenation of strings was explicit, "[abc]" was "Any("abc")", and so on. Nonetheless, SNOBOL was highly influential in this area, and one can see echoes of the language in current regular expressions. That said, SNOBOL is not heavily used now, and the Python SNOBOL module seems to have suffered the same fate; its last release was in 2002.
Another approach is the rxb.py module by Ka-Ping Yee. This module, posted in 2005, creates a new, relatively verbose but relatively readable language for the creation of patterns. Using this language, the regular expression shown above would look something like:
<A + any(wordchars + whitespace)> + label(1, anychars) + </A>
(Note to readers; the above is totally untested and should not be relied upon for production use). This module, too, has not seen a great deal of use.
Various other packages are out there. For example, one can try to use Icon-style pattern matching with Python. For something completely different, there is the eGenix mxTextTools module, which allows the creation of text-matching programs in an assembly-like language, complete with goto constructs. mxTextTools is intimidating and not necessarily any easier to read than regular expressions, but it is said to be powerful and fast, and there are a number of real users.
Still, none of these seem likely to replace regular expressions as the first tool Python programmers reach for when they need to perform string matching. Python creator Guido van Rossum thinks things will stay that way:
Pushing aside an established incumbent is always hard, and regular expressions are well established indeed. It is never enough to simply be better in this situation; the proposed replacement has to be a lot better. As Guido noted, nothing seems to have come along which is that much better, and it may be that nothing ever will. For some medium-term value of "ever," anyway.
But, then, one also should not underestimate the ingenuity of free software
developers. Or their persistence. People will almost certainly continue
to throw themselves against this problem, and, maybe, somebody will come up
with something interesting. Until then, we'll have to continue beating our
heads against our desks as we try to figure out why our expressions don't
work as intended.
Replacing regexps
Posted Mar 3, 2011 2:34 UTC (Thu)
by tshow (subscriber, #6411)
[Link] (6 responses)
Posted Mar 3, 2011 2:34 UTC (Thu) by tshow (subscriber, #6411) [Link] (6 responses)
If you're going to eat the cost of doing something in a scripting language, then (IMO) that language had better have:
- garbage collection in the core system
- robust string handling including a full RE suite in the core language
- lambda or some useful equivalent in the core language
- an associative array type in the core language
- a sane roadmap
Python fails on the RE suite and the sane roadmap. In fairness, they have a relatively sane roadmap, but they followed their GPS into the weeds trying to get to 3.0 and are currently on what my wife refers to as a "crop tour".
I've been using ruby instead, and enjoying it greatly. Maybe I'll try python again when the 3.0 transition is complete, but if they still haven't nailed down what they want to do for REs that may be a legacy my grandkids will have to fulfill for me.
Replacing regexps
Posted Mar 3, 2011 3:33 UTC (Thu)
by foom (subscriber, #14868)
[Link] (3 responses)
Posted Mar 3, 2011 3:33 UTC (Thu) by foom (subscriber, #14868) [Link] (3 responses)
Replacing regexps
Posted Mar 4, 2011 18:10 UTC (Fri)
by tshow (subscriber, #6411)
[Link] (2 responses)
Posted Mar 4, 2011 18:10 UTC (Fri) by tshow (subscriber, #6411) [Link] (2 responses)
2006 or so, actually. But while "re" may well have been the "one" regexp API, the others are (were?) all still there in the docs. At least at the time, it wasn't particularly obvious from the docs which library was the canonical one.
Replacing regexps
Posted Mar 4, 2011 20:14 UTC (Fri)
by foom (subscriber, #14868)
[Link] (1 responses)
Posted Mar 4, 2011 20:14 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)
Python 2.4 was the current version until 2.5 was released at the end of 2006, so, looking at its docs:
http://docs.python.org/release/2.4/
The only regular expression library mentioned is "re".
You actually have to go all the way back to 1.6 to find a reference to the old regex module:
http://docs.python.org/release/1.5/lib/node69.html#SECTIO...
http://docs.python.org/release/1.6/lib/module-regex.html
which says rather clearly:
"""
Obsolescence note: This module is obsolete as of Python version 1.5; [...]. All new code in need of regular expressions should use the new re module
"""
Replacing regexps
Posted Mar 7, 2011 3:10 UTC (Mon)
by tshow (subscriber, #6411)
[Link]
Posted Mar 7, 2011 3:10 UTC (Mon) by tshow (subscriber, #6411) [Link]
That said, in retrospect I may also have been relying on O'Reilly books that may not have been up to date at the time, so the fault may have been mine.
Replacing regexps - Python
Posted Mar 5, 2011 0:16 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (1 responses)
Posted Mar 5, 2011 0:16 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)
If you're going to eat the cost of doing something in a scripting language,
How is Python a scripting language?
I don't write Python, but I've worked with a fair amount of it, and it looks about as script-like as typical C code.
Replacing regexps - Python
Posted Mar 7, 2011 3:06 UTC (Mon)
by tshow (subscriber, #6411)
[Link]
Posted Mar 7, 2011 3:06 UTC (Mon) by tshow (subscriber, #6411) [Link]
To be clearer: "If you're going to eat the runtime cost of an interpreted language..."
I tend to use "scripting language" and "interpreted language" interchangeably, which I'll admit is arguably bad form.
Replacing regexps
Posted Mar 3, 2011 2:41 UTC (Thu)
by mrons (subscriber, #1751)
[Link] (2 responses)
Posted Mar 3, 2011 2:41 UTC (Thu) by mrons (subscriber, #1751) [Link] (2 responses)
Replacing regexps
Posted Mar 4, 2011 9:50 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (1 responses)
Posted Mar 4, 2011 9:50 UTC (Fri) by jezuch (subscriber, #52988) [Link] (1 responses)
Replacing regexps
Posted Mar 9, 2011 2:42 UTC (Wed)
by jtc (guest, #6246)
[Link]
Posted Mar 9, 2011 2:42 UTC (Wed) by jtc (guest, #6246) [Link]
It's very awesome! It's at a level of about 98.975%, where 100% is maximum possible awesomeness.
Replacing regexps
Posted Mar 3, 2011 3:44 UTC (Thu)
by dw (subscriber, #12017)
[Link] (1 responses)
If performance isn't a primary concern, PyParsing is worth a gander. While its reuse of Python syntax can be a bit confusing, the result is quite readable even for more complex grammars.
Posted Mar 3, 2011 3:44 UTC (Thu) by dw (subscriber, #12017) [Link] (1 responses)
Replacing regexps
Posted Mar 4, 2011 11:45 UTC (Fri)
by debacle (subscriber, #7114)
[Link]
Posted Mar 4, 2011 11:45 UTC (Fri) by debacle (subscriber, #7114) [Link]
Replacing regexps
Posted Mar 3, 2011 4:29 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link] (4 responses)
I don't think that it's really a question of replacing regexps, the underlying mathematical construct. Instead, maybe we should think about alternative source representations that are easier to understand and maintain. These alternative representations would still be compiled into the same automata that regular expression libraries use. Traditional regexps would still be the best choice for simple cases, but for more complex cases the traditional regexp would play a role similar to assembly language.
Posted Mar 3, 2011 4:29 UTC (Thu) by JoeBuck (subscriber, #2330) [Link] (4 responses)
Replacing regexps
Posted Mar 3, 2011 5:25 UTC (Thu)
by wahern (subscriber, #37304)
[Link] (2 responses)
Posted Mar 3, 2011 5:25 UTC (Thu) by wahern (subscriber, #37304) [Link] (2 responses)
http://www.complang.org/ragel/
You can already inline Ragel code with C, C++, Objective-C, D, Java and Ruby source code. Ragel is a DFA-based regex compiler which, among other things, has a syntax for named expressions.
The example from the article would look something like:
wordchars = alnum | '"' | "'" | "=";
whitespace = " " | [\t\n\r];
text = [^<]*;
A_begin = "<A" . (wordchars|whitespace)* . ">"
A_close = "</A>";
match = A_begin . text . A_close;
Of course, pure regular expressions can't match recursive grammars like HTML. DFA implementations can be orders of magnitude faster than NFA implementations (like Python, et al), and provide guaranteed worst case performance; but it can't add all the extra bells and whistles other implementations provide. However, Ragel has other tricks up its sleeves, like state charts, and the ability execute code or jump into and out of matching machines at arbitrary points. But using these features means you probably can't just map the representation to existing Python engines.
Replacing regexps
Posted Mar 3, 2011 23:59 UTC (Thu)
by Simetrical (guest, #53439)
[Link]
Posted Mar 3, 2011 23:59 UTC (Thu) by Simetrical (guest, #53439) [Link]
http://www.whatwg.org/specs/web-apps/current-work/multipa...
Where by "algorithmically" I mean "by pseudocode measuring about 90 printed pages".
Maybe you were thinking about XML.
Replacing regexps
Posted Mar 5, 2011 16:54 UTC (Sat)
by Tet (subscriber, #5433)
[Link]
Maybe someone could create a Python output module for Ragel
Posted Mar 5, 2011 16:54 UTC (Sat) by Tet (subscriber, #5433) [Link]
Yes please. It's long been on my todo list, but given my current lack of free time, it looks like it'll stay there for some time.
Replacing regexps
Posted Mar 3, 2011 4:45 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (9 responses)
Posted Mar 3, 2011 4:45 UTC (Thu) by jzbiciak (guest, #5246) [Link] (9 responses)
The example immediately reminded me of the warning that you can't correctly parse general (X)HTML with a regexp, because the language isn't regular.
Replacing regexps
Posted Mar 3, 2011 5:32 UTC (Thu)
by wahern (subscriber, #37304)
[Link] (6 responses)
Posted Mar 3, 2011 5:32 UTC (Thu) by wahern (subscriber, #37304) [Link] (6 responses)
pegs are the bee's knees
Posted Mar 3, 2011 16:34 UTC (Thu)
by wingo (guest, #26929)
[Link]
Posted Mar 3, 2011 16:34 UTC (Thu) by wingo (guest, #26929) [Link]
Sorry for the content-less pile-on but PEGs are pretty sweet and folks should check them out.
Replacing regexps
Posted Mar 4, 2011 0:02 UTC (Fri)
by Simetrical (guest, #53439)
[Link] (4 responses)
Posted Mar 4, 2011 0:02 UTC (Fri) by Simetrical (guest, #53439) [Link] (4 responses)
Replacing regexps
Posted Mar 4, 2011 18:44 UTC (Fri)
by wahern (subscriber, #37304)
[Link] (3 responses)
Posted Mar 4, 2011 18:44 UTC (Fri) by wahern (subscriber, #37304) [Link] (3 responses)
I'm not the world's biggest fan of XML as a use-everywhere markup language. But when it comes to the processing pipeline of web page generation, XML is a beautiful thing.
Replacing regexps
Posted Mar 4, 2011 20:19 UTC (Fri)
by foom (subscriber, #14868)
[Link] (1 responses)
Posted Mar 4, 2011 20:19 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)
So, making everyone support a *second* markup format everywhere (because, of course, HTML support can't possibly be dropped) really doesn't buy you much. XHTML should probably just be discarded as a failed idea.
Replacing regexps
Posted Mar 4, 2011 21:17 UTC (Fri)
by Simetrical (guest, #53439)
[Link]
Posted Mar 4, 2011 21:17 UTC (Fri) by Simetrical (guest, #53439) [Link]
But for web page use, that pales in comparison to XML's flaws (even leaving aside lack of support in IE < 9). I agree that at this point, there's really no reason for anyone to use XHTML, if there ever was. The HTML5 spec only allows it because it's just a few extra paragraphs, and does a lot to shut up XML diehards who will reject tag soup on ideological principles.
Replacing regexps
Posted Mar 4, 2011 21:11 UTC (Fri)
by Simetrical (guest, #53439)
[Link]
Posted Mar 4, 2011 21:11 UTC (Fri) by Simetrical (guest, #53439) [Link]
1) For some years, there was a strong political faction that insisted on XML for everything, to the exclusion of all else. This faction took control at the W3C and was so radical that it refused to even permit work to happen at the W3C on improving non-XML versions of HTML. This is the entire reason the WHATWG exists in the first place: browser implementers determined that they had to continue adding features to non-XML versions of HTML, but the W3C refused to cooperate.
2) XML is fragile. It works great if you write it using only XML tools, which are guaranteed to output well-formed XML, but in practice, it's often much easier to use string concatenation, and you want to edit it by hand, etc. When people try to do this with XML, they routinely break well-formedness and the entire page fails to display. This can even happen to well-designed applications -- I've seen with my own eyes a yellow screen of death when using Firefox's built-in directory browser to browse my /tmp, because I had a file there with a control character in its name, and Firefox passed that into an XML file, which thus wasn't well-formed. You only need one place where you allow strings to creep in without checking them for your entire program to break.
3) XML is, in some ways, gratuitously more complicated or harder to work with than HTML. Namespaces are one giant wart that XML introduced, which complicate everything related to the DOM but actually wind up being totally useless baggage for web browsers. Entity references are another headache: in HTML there's a fixed list, but XML requires a DTD if you use named entities and want to maintain well-formedness. The DTD must in practice be fetched from a server before you can parse the page, unless you happen to have hardcoded that DTD into your application.
As far as the processing pipeline of web page generation, HTML5 parsers now exist for some common languages (like Python). These are drop-in replacements for DOM/SAX/etc.-based XML parsers, so an app that works with XML can support text/html with no extra effort if the library is available. That removes almost the entire reason for preferring XML for web pages, or will once HTML5 parsers are more widespread.
Replacing regexps
Posted Mar 3, 2011 16:15 UTC (Thu)
by cdmiller (guest, #2813)
[Link]
Posted Mar 3, 2011 16:15 UTC (Thu) by cdmiller (guest, #2813) [Link]
grammar XML {
token TOP { ^ <xml> $ };
token xml { <text> [ <tag> <text> ]* };
token text { <-[<>&]>* };
rule tag {
'<'(\w+) <attributes>*
[
| '/>' # a single tag
| '>'<xml>'</' $0 '>' # an opening and a closing tag
]
};
token attributes { \w+ '="' <-["<>]>* '"' };
};
Replacing regexps
Posted Mar 3, 2011 16:26 UTC (Thu)
by jzbiciak (guest, #5246)
[Link]
Posted Mar 3, 2011 16:26 UTC (Thu) by jzbiciak (guest, #5246) [Link]
Generative regular expressions
Posted Mar 3, 2011 10:28 UTC (Thu)
by epa (subscriber, #39769)
[Link] (1 responses)
In CS theory, a regular expression is often thought of as a way to generate strings, rather than to test a given string. So a regexp corresponds to a (possibly infinite) set of strings which it generates. The two definitions are equivalent, although the common grep-like behaviour of scanning through a string for a match at any point muddies the waters a bit.Posted Mar 3, 2011 10:28 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)
A useful debugging tool might be to take a regexp and generate all strings it matches (or the thousand shortest such strings). You could then check for unexpected behaviour.
Generating strictly every string (with almost all possible Unicode characters for an occurrence of '.', for example) would not be necessary, instead 'a', 'b', 'c' could be used to stand for any character, provided these characters do not appear in the regexp or elsewhere in the source string.
Generative regular expressions
Posted Mar 3, 2011 20:32 UTC (Thu)
by dtlin (subscriber, #36537)
[Link]
Posted Mar 3, 2011 20:32 UTC (Thu) by dtlin (subscriber, #36537) [Link]
Something like Regexp::Genex?
Replacing regexps
Posted Mar 3, 2011 16:53 UTC (Thu)
by martinfick (guest, #4455)
[Link] (4 responses)
Posted Mar 3, 2011 16:53 UTC (Thu) by martinfick (guest, #4455) [Link] (4 responses)
For starters, rather even more obvious, most A tags also have an href="" in them, and the presence of a > in those quotes would likely also break things. I suspect that a fair amount of mailto: URLs have > in them. Of course, plenty of illegal URLs and other attributes with > in them would also break things.
Another obvious one: a closing </A> on a different line is likely to confuse some RE implementations if not told to span newlines.
I also suspect that whitespace (including newlines) before and after the As are legal also, and missed by this expression?
What all this points out to me is not so much a problem with REs, but a problem with trying to parse a complex format like html. I fail to see how a better parsing language would help to avoid the problems mentioned above if someone doesn't think about those possibilities. But I can see how a better approach might at least help express a solution to the above problems more easily than REs can.
Replacing regexps
Posted Mar 4, 2011 0:25 UTC (Fri)
by Simetrical (guest, #53439)
[Link] (3 responses)
Posted Mar 4, 2011 0:25 UTC (Fri) by Simetrical (guest, #53439) [Link] (3 responses)
data:text/html,<!doctype html><a href=http://google.com>Foo<a>Bar
Notice how only "Foo" is a link. Or how about this:
data:text/html,<!doctype html><a href=http://google.com>Foo<xmp></a></xmp>bar
where the end tag is ignored because it's inside <xmp>. <plaintext> would work similarly. There are probably lots of really fun examples that I could come up with if I were an HTML parser expert.
Of course, these aren't "valid" HTML, but this stuff exists in real web pages and the HTML specification says how to parse it and that's how browsers do parse it, so . . .
Replacing regexps
Posted Mar 4, 2011 5:12 UTC (Fri)
by nicooo (guest, #69134)
[Link] (2 responses)
Posted Mar 4, 2011 5:12 UTC (Fri) by nicooo (guest, #69134) [Link] (2 responses)
Replacing regexps
Posted Mar 4, 2011 17:14 UTC (Fri)
by Simetrical (guest, #53439)
[Link] (1 responses)
Posted Mar 4, 2011 17:14 UTC (Fri) by Simetrical (guest, #53439) [Link] (1 responses)
Replacing regexps
Posted Mar 4, 2011 20:08 UTC (Fri)
by dmarti (subscriber, #11625)
[Link]
I like...
Posted Mar 4, 2011 20:08 UTC (Fri) by dmarti (subscriber, #11625) [Link]
d = BeautifulSoup(string) for atag in d.findAll('a'):another POV on this issue: "You can't parse [X]HTML with regex."
Replacing regexps
Posted Mar 5, 2011 16:41 UTC (Sat)
by cras (guest, #7000)
[Link] (6 responses)
Posted Mar 5, 2011 16:41 UTC (Sat) by cras (guest, #7000) [Link] (6 responses)
echo "regexp" | regexp-colorize
I guess someone's already done something like that.
Replacing regexps
Posted Mar 7, 2011 9:19 UTC (Mon)
by dtlin (subscriber, #36537)
[Link] (3 responses)
Posted Mar 7, 2011 9:19 UTC (Mon) by dtlin (subscriber, #36537) [Link] (3 responses)
$ perl -Mre=debugcolor -e'qr/foo(..)bar/' Compiling REx "foo(..)bar" Final program: 1: EXACT <foo> (3) 3: OPEN1 (5) 5: REG_ANY (6) 6: REG_ANY (7) 7: CLOSE1 (9) 9: EXACT <bar> (11) 11: END (0) anchored "foo" at 0 (checking anchored) minlen 8 Freeing REx: "foo(..)bar"
Replacing regexps
Posted Mar 7, 2011 17:53 UTC (Mon)
by cras (guest, #7000)
[Link] (2 responses)
I was thinking about something that would make it quicker to understand what a regexp does to a person who already understands regexps well. /foo(..)bar/ is far quicker for me to read and parse then the verbose debug output.
Posted Mar 7, 2011 17:53 UTC (Mon) by cras (guest, #7000) [Link] (2 responses)
I was thinking like for example /foo(bar\((.*)baz)/ could be outputted with:
foo( bar\(( .*) baz)Where \( would be colored differently to make it clearly distinct from the '('. And other regexp characters could be colored with different colors as well. This could of course use more thought to make it usable when regexps contain spaces.
Replacing regexps
Posted Mar 8, 2011 7:04 UTC (Tue)
by spaetz (guest, #32870)
[Link] (1 responses)
Like the python .VERBOSE option?
Posted Mar 8, 2011 7:04 UTC (Tue) by spaetz (guest, #32870) [Link] (1 responses)
This flag allows you to write regular expressions that look nicer. Whitespace within the pattern is ignored, except when in a character class or preceded by an unescaped backslash, and, when a line contains a '#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.
a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", re.VEBOSE)
Replacing regexps
Posted Mar 8, 2011 11:59 UTC (Tue)
by anselm (subscriber, #2796)
[Link]
Posted Mar 8, 2011 11:59 UTC (Tue) by anselm (subscriber, #2796) [Link]
The same thing also works in Perl, and has done so for the last 15 years or so – I just checked my second-edition copy of »Programming Perl«, which came out in 1996 and explains the feature.
Replacing regexps
Posted Mar 10, 2011 14:35 UTC (Thu)
by tmg (guest, #46149)
[Link]
Posted Mar 10, 2011 14:35 UTC (Thu) by tmg (guest, #46149) [Link]
See http://www.stat.tamu.edu/~dahl/teaching/605/3.8/kregexped...
Replacing regexps
Posted Mar 17, 2011 10:13 UTC (Thu)
by Sho (subscriber, #8956)
[Link]
Posted Mar 17, 2011 10:13 UTC (Thu) by Sho (subscriber, #8956) [Link]
Replacing regexps
Posted Mar 27, 2011 7:32 UTC (Sun)
by sensille (subscriber, #73175)
[Link]
Posted Mar 27, 2011 7:32 UTC (Sun) by sensille (subscriber, #73175) [Link]
There is another theory which states that this has already happened"
(based on Douglas Adams)