[Catalyst] Url Encoded UTF8 parameters

Bill Moseley moseley at hank.org
Sun Aug 2 14:37:17 GMT 2015


BTW -- I wonder about the Catalyst behavior here.

On Sat, Aug 1, 2015 at 10:36 PM, Bill Moseley <moseley at hank.org> wrote:

>
>
> On Sat, Aug 1, 2015 at 6:31 AM, Stefan <maillist at s.profanter.me> wrote:
>
>> Hi,
>>
>> if a URL parameter contains a Unicode character (e.g.
>> www.example.com/?param=%D6lso%DF which stands for param=Ölsoße), the
>> parameter is not correctly parsed as Unicode.
>>
>
One note here -- data over the wire must be encoded into octets.   So, all
Unicode characters must be encoded and then decoded when received.  (You
can't send "Unicode characters".)   UTF-8 is used now (for obvious
reasons).  http://tools.ietf.org/html/rfc3986.

You are specifying %D6 -- although the Unicode characters is U+00D6, the
UTF-8 octet sequence is 0xC3 0x96. See:
http://www.fileformat.info/info/unicode/char/00D6/index.htm

Unless otherwise instructed, Catalyst uses UTF-8
<https://github.com/perl-catalyst/catalyst-runtime/blob/master/lib/Catalyst/Engine.pm#L579>
as the encoding for decoding query parameters -- query parameters are
decoded from UTF-8 octets to Perl characters.

As your example showed, if you use invalid UTF-8 sequences then
Encode::decode() as used by Catalyst will replace those with the U+FFFD
substitution character
<http://www.fileformat.info/info/unicode/char/fffd/index.htm> "�".

This may or may not be what you want.   Personally, I think it's not
correct to silently modify user input.   You intended to pass "Ölsoße" but
ended up with "�lso�e" -- is that really the data you would want to
process/store for the request?   Seems unlikely.

If "param" is suppose to be passed as textual, UTF-8-encoded octets, and it
isn't, then maybe returning a 400 is a better way of handling that.   That
probably would have helped you see what is wrong in this case.

i.e. use "eval { decode( $default_query_encoding, $str, FB_CROAK |
LEAVE_SRC ); }" to catch invalid data and return to the client the "$str"
that failed and why.

Of course, it is also possible that you have some query parameters that you
want decoded as UTF-8 and some that might represent something else (a raw
sequence of bytes), and want more manual control.  In that case
$c->config->{do_not_decode_query} could be used to bypass the decoding.
But then, you must manually decode() yourself.

-- 
Bill Moseley
moseley at hank.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.scsys.co.uk/pipermail/catalyst/attachments/20150802/af832b93/attachment.htm>


More information about the Catalyst mailing list