[Catalyst] Alien::Dojo uses regexes to parse HTML, so what?

Thomas Hartman tphyahoo at gmail.com
Tue May 30 11:06:59 CEST 2006

I hate to throw fuel on the fire, but what personally convinced *me*
that regexes are a bad idea for parsing html was the issue of html



*Real* html parsing with regex is either impossible, or so hard that
the mind melts.


2006/5/30, phaylon <phaylon at dunkelheit.at>:
> Dominique Quatravaux said:
> > No it's not. We are not trying to address the problem of parsing HTML
> > in general, we are trying to address the problem of parsing *one
> > single page*.
> *One single page of HTML.* HTML is not a data structure, it's a (partly)
> fuzzy larkup language.
> > Since I apparently have to be that explicit to make my point, consider
> Well, seeing how you don't seem to *want* to argue about it, but rather
> just prove your point, I think it might better we end this discussion?
> >   my ($url) = qr{<a ^>+
> > href="(http://download.dojotoolkit.org/release[^"]+)"}sx
> <a href='http://download.dojotoolkit.org/release/foo'>
> <a href="http://www.dojotoolkit.org/download/release/foo">
> <a href="ftp://ftp.dojotoolkit.org/realease/foo">
> <A HREF="http://download.DojoToolkit.org/release/foo">
> > or even
> >
> >   my ($url) = qr{href="http://download.dojotoolkit.org/release[^"]+)"}sx
> Same as above.
> > and pray tell me what's wrong with those. HTML is a *text* language,
> > for chrissake, it was designed *purposefully* so that I am able to do
> > that sort of thing.
> Perl is also just a "*text* language," please show me the Regex to parse
> it. Just accept it, regular expressions were *not* made to parse HTML.
> They might be built to be utilized by a *HTML Parser* to work with the
> HTML, but they don't really parse it themselves.
> I hope *I* have been explicit enough this time.
> p
> _______________________________________________
> Catalyst mailing list
> Catalyst at lists.rawmode.org
> http://lists.rawmode.org/mailman/listinfo/catalyst

More information about the Catalyst mailing list