<div dir="ltr">BTW -- I wonder about the Catalyst behavior here.<br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Aug 1, 2015 at 10:36 PM, Bill Moseley <span dir="ltr"><<a href="mailto:moseley@hank.org" target="_blank">moseley@hank.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span class="">On Sat, Aug 1, 2015 at 6:31 AM, Stefan <span dir="ltr"><<a href="mailto:maillist@s.profanter.me" target="_blank">maillist@s.profanter.me</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div lang="DE" link="#0563C1" vlink="#954F72"><div><p class="MsoNormal"><span lang="EN-US">Hi,<u></u><u></u></span></p><p class="MsoNormal"><span lang="EN-US">if a URL parameter contains a Unicode character (e.g. <a href="http://www.example.com/?param=%D6lso%DF" target="_blank">www.example.com/?param=%D6lso%DF</a> which stands for param=Ölsoße), the parameter is not correctly parsed as Unicode.</span></p></div></div></blockquote></span></div></div></div></blockquote><div><br></div><div>One note here -- data over the wire must be encoded into octets. So, all Unicode characters must be encoded and then decoded when received. (You can't send "Unicode characters".) UTF-8 is used now (for obvious reasons). <a href="http://tools.ietf.org/html/rfc3986">http://tools.ietf.org/html/rfc3986</a>.</div><div><br></div><div>You are specifying %D6 -- although the Unicode characters is U+00D6, the UTF-8 octet sequence is 0xC3 0x96. See: <a href="http://www.fileformat.info/info/unicode/char/00D6/index.htm">http://www.fileformat.info/info/unicode/char/00D6/index.htm</a></div><div><br></div><div>Unless otherwise instructed, <a href="https://github.com/perl-catalyst/catalyst-runtime/blob/master/lib/Catalyst/Engine.pm#L579">Catalyst uses UTF-8</a> as the encoding for decoding query parameters -- query parameters are decoded from UTF-8 octets to Perl characters.</div><div><br></div><div>As your example showed, if you use invalid UTF-8 sequences then Encode::decode() as used by Catalyst will replace those with the <a href="http://www.fileformat.info/info/unicode/char/fffd/index.htm">U+FFFD substitution character</a> "�".</div><div><br></div><div>This may or may not be what you want. Personally, I think it's not correct to silently modify user input. You intended to pass "Ölsoße" but ended up with "�lso�e" -- is that really the data you would want to process/store for the request? Seems unlikely.</div><div><br></div><div>If "param" is suppose to be passed as textual, UTF-8-encoded octets, and it isn't, then maybe returning a 400 is a better way of handling that. That probably would have helped you see what is wrong in this case.</div><div><br></div><div>i.e. use "eval { decode( $default_query_encoding, $str, FB_CROAK | LEAVE_SRC ); }" to catch invalid data and return to the client the "$str" that failed and why.</div><div><br></div><div>Of course, it is also possible that you have some query parameters that you want decoded as UTF-8 and some that might represent something else (a raw sequence of bytes), and want more manual control. In that case $c->config->{do_not_decode_query} could be used to bypass the decoding. But then, you must manually decode() yourself.</div></div><div><br></div>-- <br><div class="gmail_signature">Bill Moseley<br><a href="mailto:moseley@hank.org" target="_blank">moseley@hank.org</a></div>
</div></div>