[Catalyst] Re: decoding in core

Jonathan Rockway jon at jrock.us
Fri Feb 20 17:57:29 GMT 2009


Braindump follows.

* On Fri, Feb 20 2009, Tomas Doran wrote:
> On 6 Feb 2009, at 17:36, Bill Moseley wrote:
>>
>> Sure.  IIRC, I think there's already been some patches and code posted
>> so maybe I can dig that up again off the archives.
>
> Please do.
>
>> But, sounds like
>> it's not that important of an issue.
>
> The fact that nobody is working on it currently is not an indication
> that it isn't an important problem to try to solve.

I meant to write a plugin to do this a long time ago, but I guess I
never cared enough.

The problem with writing a plugin or making this core is that people
really really want to misuse Unicode, and will whine when you try to
force correctness upon them.

The only place where you are really allowed to use non-ASCII characters
are in the request and response.  (HTTP has a way of representing the
character encoding of its payload -- URLs and Cookies don't.)

C::P::Unicode handles this correct usage correctly.  The problem is that
people want Unicode to magically work where it's not allowed.  This
includes HTTP headers (WTF!?), and URLs.  (BTW, when I say Unicode, I
don't necessarily mean Unicode... I mean non-ASCII characters.  The
Japanese character sets contain non-Unicode characters, and some people
want to put these characters in their URLs or HTTP headers.  I wish I
was making ths up, but I am not.  The Unicode process really fucked over
the Asian languages.)

So anyway, the plugin basically needs to have the following config
options, so users can specify what they want.  Inside Catalyst, only
Perl characters should be allowed, unless you mark the string as binary
(there is a CPAN module that does this, Something::BLOB).

  * Input HTTP header encoding (ASCII default)
    (this is for data in $c->req->headers, cookies, etc.)
    (perhaps cookies should be separately configured)

  * Input URI encoding (probably UTF-8 default)
    (the dispatcher will dispatch on the decoded characters)
    (source code encoding is handled by Perl, hopefully)

  * Input request body encoding (read HTTP headers and decide)

  * Output HTTP headers encoding (maybe die if this happens, because
    it's totally illegal to have non-ascii in the headers)

  * Output URI encoding ($c->uri_for and friends will use this to
    translate the names of actions that are named with wide characters)

  * Output response body encoding (this needs to update the HTTP
    headers, namely the charset= part of Content-type)

I think that is everything.

There are subtle issues, like knowing not to touch XML (it's binary),
dealing with $c->res->body( <filehandle> ), and so on.

One last thing, if this becomes core, it will definitely break people's
apps.  Many, many apps are blissfully unaware of characters and treat
text as binary... and their apps kind-of appear to work.  As soon as
they get some real characters in their app, though, they will have
double-encoded nonsense all over the place, and will blame you for this.
("I loaded Catalyst::Plugin::Unicode, and my app broke!  It's all your
fault."  Yup, people mail that to me privately all the time.  For some
reason, they think I am going to personally fix their app, despite
having written volumes of documentation about this.  Wrong.)

Anyway, I just wanted to get this out of my head and onto paper, for
someone else to look at and perhaps implement. :)

Regards,
Jonathan Rockway

--
print just => another => perl => hacker => if $,=$"



More information about the Catalyst mailing list