[Catalyst] My Life with UTF-8

Sun Aug 13 06:39:21 CEST 2006

Konnichiwa,

On 8/12/06, Jonathan Rockway <jon at jrock.us> wrote:

> The first unicode breakage I had was when I added Japanese-style dates
> as timestamps on the pages.  (Japanese day-name character in
> parenthesis.) What's weird was, adding this to the page worked fine --
> but it broke OTHER unicode characters on the page (sourced from a file
> or file attribute).  Adding "use utf8" to the top of my source file
> fixed my problems, on Linux anyway.  (Never tried on OpenBSD.)

Sounds like a traditional "Unicode string + UTF-8 bytes = BOOM"
problem. To solve that, you should handle everything in Unicode string
(utf-8 flagged), or everything in utf-8 bytes (utf::encode($str)).
Mixing the two breaks the other one.

But it's sometimes hard, since some CPAN modules don't care about
Unicode string and just return strings in utf-8 bytes.

> The next problem I noticed was that C::V::TT::ForceUTF8 broke TT's "uri"
> filter.  According to the HTML validator, URIs can't be unicode, so you
> have to encode the URI to UTF-8.  TT's URI filter was documented to do
> this, but it translated anything with the 8th bit set to nothing,

Yeah, Template::Stash::ForceUTF8 and Template::Provider::Encoding is
made just to fix that issue. Interesting to hear that TT uri filter
gets borked by that. Any working code that shows the breakage?

BTW we use Stash::ForceUTF8 and Provider::Encoding on our production
boxes and they work fine.

> Any way I can tell perl, "trust me, everything is already UTF-8... don't
> #^$ing touch it."?

encoding::warnings might be for your help. Not sure if it works
actually, but the documentation would be a great help at least.
http://search.cpan.org/~audreyt/encoding-warnings-0.10/

-- 
Tatsuhiko Miyagawa