[Catalyst] Alien::Dojo uses regexes to parse HTML, so what?

A. Pagaltzis pagaltzis at gmx.de
Mon May 29 21:28:59 CEST 2006


* Dominique Quatravaux <dom at idealx.com> [2006-05-29 19:20]:
> or even
> 
>   my ($url) = qr{href="http://download.dojotoolkit.org/release[^"]+)"}sx

You’re getting closer; that has fewer failure modes than trying
to parse the whole anchor tag. Off the top of my head:

    my ($url) = qr{href\s*=\s*(["'])?(http://download.dojotoolkit.org/release(?:.(?!(?(1)\1|\s)))+)}si;

I think that would be enough to catch all possible variations. Untested.

But:

> and pray tell me what's wrong with those. HTML is a *text*
> language, for chrissake, it was designed *purposefully* so that
> I am able to do that sort of thing.

You are having an XY problem (where X is “parse the page” and Y
is “pattern”). Matt is right: the correct answer is not to parse
at all.

Regards,
-- 
#Aristotle
*AUTOLOAD=*_;sub _{s/(.*)::(.*)/print$2,(",$\/"," ")[defined wantarray]/e;$1};
&Just->another->Perl->hacker;



More information about the Catalyst mailing list