[Catalyst] Re: decoding in core

Sun Feb 22 16:54:21 GMT 2009

On Fri, Feb 20, 2009 at 11:57:29AM -0600, Jonathan Rockway wrote:
> 
> The problem with writing a plugin or making this core is that people
> really really want to misuse Unicode, and will whine when you try to
> force correctness upon them.

I'm not sure what you mean by wanting to misuse Unicode.  You mean
like decode using a different encoding than what the charset is in the
HTTP headers?

> The only place where you are really allowed to use non-ASCII characters
> are in the request and response.  (HTTP has a way of representing the
> character encoding of its payload -- URLs and Cookies don't.)
> 
> C::P::Unicode handles this correct usage correctly.

I disagree there.  First, it assumes utf8 instead of what the
request states as the encoding.  That is generally okay (where you set
accept-encoding in your forms), but why not decode as the request
states?

Second, it only decodes the request parameters.  The body_parameters
and query_parameters are left undecoded.

Is that by design?  That is, is it expected that in a POST
$c->req->parameters->{foo} would be characters where
$c->req->body_parameters->{foo} is undecoded octets?  I would not want
or expect that.

> The problem is that
> people want Unicode to magically work where it's not allowed.  This
> includes HTTP headers (WTF!?), and URLs.  (BTW, when I say Unicode, I
> don't necessarily mean Unicode... I mean non-ASCII characters.  The
> Japanese character sets contain non-Unicode characters, and some people
> want to put these characters in their URLs or HTTP headers.  I wish I
> was making ths up, but I am not.  The Unicode process really fucked over
> the Asian languages.)

I'm not sure we want to go down that path.  Maybe a plugin for doing
crazy stuff with HTTP header encoding, but my initial email was really
just about moving decoding of the body (when we have a charset in the
request) and encoding on sending (again if there's a charset in the
response headers) into core.

Trying to do more than that is probably asking for headaches (and
whining).

I think there's reasonable debate at what point in the request
decoding should happen, though.  Frankly, I'm not sure Catalyst should
decode, rather HTTP::Body should.  HTTP::Body looks at the content
type header and if it's application/x-www-form-urlencoded it will
decode the body into separate parameters.  But, why should it ignore
decoding the charset also specified?

The query parameters are more troublesome, of course.  Seems the
common case is to use utf8 in URLs as the encoding, and in the end the
encoding just has to be assumed (or specified as a separate
parameter).  uri_for()'s current behavior of encoding to utf8 is
probably a good way to go and to just always decoded the query
parameters as utf8 in Catalyst.  I suppose uri_for() could add an
additional "_enc=utf8" parameter to allow for different encodings, but
I can't imagine where just assuming utf8 would not be fine.

Of course, someone will want to mix encodings in different query
parameters.

> There are subtle issues, like knowing not to touch XML (it's binary),
> dealing with $c->res->body( <filehandle> ), and so on.

The layer can be set on the file handle.  XML will be decoded as
application/octet-stream by HTTP::Body, so that should be ok.
Although, if there's a chraset in the request I would still
probably argue that decoding would be the correct thing to do.

For custom processing I currently extend HTTP::Body.  For example:

    $HTTP::Body::TYPES->{'text/xml'} = 'My::XML::Parser';

which does stream parsing of the XML and thus handles the XML
charset decoding.

> One last thing, if this becomes core, it will definitely break people's
> apps.  Many, many apps are blissfully unaware of characters and treat
> text as binary... and their apps kind-of appear to work.  As soon as
> they get some real characters in their app, though, they will have
> double-encoded nonsense all over the place, and will blame you for this.

That may be true for some.  For most they probably have simply ignored
encoding and don't realize they are working with octets instead of
characters, and thanks to Perl it just all works.  So working with
real characters instead will likely be transparent for them.

Catalyst::Plugin::Unicode blindly decodes using utf::decode() and I
think that's a no-op if the content has already been decoded (utf8
flag is already set).  Likewise, it only encodes if the utf8 flag is
set.  So, users of that plugin should be ok if character encoding
was handled in core and they don't remove the plugin.

-- 
Bill Moseley
moseley at hank.org
Sent from my iMutt