[Catalyst] Alien::Dojo uses regexes to parse HTML, so what?

Tue May 30 10:52:25 CEST 2006

Dominique Quatravaux said:
> No it's not. We are not trying to address the problem of parsing HTML
> in general, we are trying to address the problem of parsing *one
> single page*.

*One single page of HTML.* HTML is not a data structure, it's a (partly)
fuzzy larkup language.

> Since I apparently have to be that explicit to make my point, consider

Well, seeing how you don't seem to *want* to argue about it, but rather
just prove your point, I think it might better we end this discussion?

>   my ($url) = qr{<a ^>+
> href="(http://download.dojotoolkit.org/release[^"]+)"}sx

<a href='http://download.dojotoolkit.org/release/foo'>
<a href="http://www.dojotoolkit.org/download/release/foo">
<a href="ftp://ftp.dojotoolkit.org/realease/foo">
<A HREF="http://download.DojoToolkit.org/release/foo">

> or even
>
>   my ($url) = qr{href="http://download.dojotoolkit.org/release[^"]+)"}sx

Same as above.

> and pray tell me what's wrong with those. HTML is a *text* language,
> for chrissake, it was designed *purposefully* so that I am able to do
> that sort of thing.

Perl is also just a "*text* language," please show me the Regex to parse
it. Just accept it, regular expressions were *not* made to parse HTML.
They might be built to be utilized by a *HTML Parser* to work with the
HTML, but they don't really parse it themselves.

I hope *I* have been explicit enough this time.

p