[Catalyst-dev] Request URI (path) normalisation

Sat Sep 27 00:11:01 BST 2008

On Sun, Sep 14, 2008 at 8:04 PM, Florian Zumbiehl <florz at florz.de> wrote:
> Hi,
>
> this email basically arose from a discussion on #catalyst/irc.perl.org
> where my (more or less) original question was for the format of the
> string that Regex actions do match against.
>
> As nobody really seemed to know the answer, it got into a discussion
> of basic URI semantics and finally kindof to the conclusion that
> the current implementation of Regex (at least) probably is broken.
> Part of that conclusion actually isn't from first-hand experience on
> my part, but rather from Sebastian Riedel's examination of the source
> of the current version, AFAICT - the debian backport package (5.7006)
> I am using behaves differently. So, please forgive me, should this
> invalidate parts of the following.
>
> So, to finally get to the meat of it: According to sri's examination,
> catalyst simply extracts the path component from the URI, but
> doesn't do any normalisation on it. This would mean that a request
> for http://bar/foo would have a different string being matched against
> the regexes than a request for http://bar/f%6fo . As those two URIs
> are mandated to be equivalent (to refer to the same resource) by the
> URI RFC (3986, 2.3), this kind of behaviour does make it pretty difficult
> to write standards-compliant software, as you'd have to match against
> ^(?:f|%66)(?:o|%6[fF]){2}$ for the example given above to meet the
> requirements.
>
> I've got no clue whether other action types may be affected by
> this, too.
>
> The behaviour I would consider sensible would be the normalisation
> of the path in such a way that any two URI paths that are mandated
> by the RFC to be equivalent will result in the exact same string,
> and any two URI paths that are not mandated by the RFC to be
> equivalent will result in different strings.
>
> IMO, in addition, as many characters as possible should be in
> unescaped form after normalisation. For the path alone, that
> would mean that only slashes in path components would really have
> to be escaped. I assume that also escaping the ASCII control range
> might be a good idea for security reasons with regard to use
> on syscall/shell interfaces. If it's supposed to be safe for direct
> injection into a URI, any other URI reserved characters probably
> should be escaped, too. But above all, I think the important
> thing is consistent, documented normalisation, independent of the
> engine.
>
> Well, I guess that this email is somewhat open-ended so far.
> But I don't really know what the next step should be - so, I'll
> leave it at that. Please don't flame me for it ;-)
>
> Florian
>
> _______________________________________________
> Catalyst-dev mailing list
> Catalyst-dev at lists.scsys.co.uk
> http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst-dev
>

Not sure if I'm on the right track or not, but I think the
normalisation of the URL would be very good. I'm guessing the Regex
problem is connected with
 |
 | sub foo :Local { my($self, $c, @args)=@_ }
 |
...where @args might contain "f%6fo" instead of whatever was meant to be there.

I haven't dug into the source myself, but would there be any issues by
making the path "sane" before it's actually handled in any way?

-- 
Best regards,
 Jan Henning Thorsen