[Catalyst] Re: Avoiding UTF8 in Catalyst

Mon Nov 23 18:14:16 GMT 2009

* Carl Johnstone <catalyst at fadetoblack.me.uk> [2009-11-23 18:50]:
> Aristotle Pagaltzis wrote:
> > Please plese don’t make statements like “not in this case”
> > without knowing what the thing you are talking about does,
> > i.e. in this case bytes::length, does. There are enough
> > misconceptions about Unicode in Perl already.
>
> As far as the usage of bytes::length. Yes I agree with you that
> the code is wrong as it's taking the byte length of perl's
> internal representation - which happens to be utf-8 and whilst
> correct in that case, isn't for any other character set and
> shouldn't be relied upon.

No: the internal representation can be either of two formats, and
which of the two you get is not reliable, because it’s purely an
implementation detail. It’s never correct. It just accidentally
works much of the time, getting the right answer by using the
wrong method.

> You *do* have to take a byte length of the string in the
> destination character set though

Yes.

> so I'm interested in what the correct solution would be.

Encode the string to the destination encoding (not just character
set), so that the string represents an encoded octet stream, and
then look at the plain old character length of that string. That
will always give you the right answer, regardless of whether that
string is packed bytes or variable-width integers.

Regards,
-- 
Aristotle Pagaltzis // <http://plasmasturm.org/>