[Catalyst-dev] Request URI (path) normalisation

Sun Sep 14 19:04:31 BST 2008

Hi,

this email basically arose from a discussion on #catalyst/irc.perl.org
where my (more or less) original question was for the format of the
string that Regex actions do match against.

As nobody really seemed to know the answer, it got into a discussion
of basic URI semantics and finally kindof to the conclusion that
the current implementation of Regex (at least) probably is broken.
Part of that conclusion actually isn't from first-hand experience on
my part, but rather from Sebastian Riedel's examination of the source
of the current version, AFAICT - the debian backport package (5.7006)
I am using behaves differently. So, please forgive me, should this
invalidate parts of the following.

So, to finally get to the meat of it: According to sri's examination,
catalyst simply extracts the path component from the URI, but
doesn't do any normalisation on it. This would mean that a request
for http://bar/foo would have a different string being matched against
the regexes than a request for http://bar/f%6fo . As those two URIs
are mandated to be equivalent (to refer to the same resource) by the
URI RFC (3986, 2.3), this kind of behaviour does make it pretty difficult
to write standards-compliant software, as you'd have to match against
^(?:f|%66)(?:o|%6[fF]){2}$ for the example given above to meet the
requirements.

I've got no clue whether other action types may be affected by
this, too.

The behaviour I would consider sensible would be the normalisation
of the path in such a way that any two URI paths that are mandated
by the RFC to be equivalent will result in the exact same string,
and any two URI paths that are not mandated by the RFC to be
equivalent will result in different strings.

IMO, in addition, as many characters as possible should be in
unescaped form after normalisation. For the path alone, that
would mean that only slashes in path components would really have
to be escaped. I assume that also escaping the ASCII control range
might be a good idea for security reasons with regard to use
on syscall/shell interfaces. If it's supposed to be safe for direct
injection into a URI, any other URI reserved characters probably
should be escaped, too. But above all, I think the important
thing is consistent, documented normalisation, independent of the
engine.

Well, I guess that this email is somewhat open-ended so far.
But I don't really know what the next step should be - so, I'll
leave it at that. Please don't flame me for it ;-)

Florian