[Catalyst] CSV / UTF-8 / Unicode

Craig Chant craig at homeloanpartnership.com
Fri Jul 5 14:59:30 GMT 2013

Nice one Mike spot on!

I did a bit of digging myself before trying anything and found this blog : http://blogs.msdn.com/b/qingsongyao/archive/2009/04/10/sql-server-and-utf-8-encoding-1-true-or-false.aspx

You learn something new every day!

It seems MS encoding is 'UCS-2' which is equivalent to 'UTF-16' (sort of, but not exact!)

Perhaps Win32::ODBC handles this natively whereas DBI::DBD::ODBC doesn't?

Having read the UCS-2 info at the bottom of the DBD::ODBC CPAN docs it seems DBD::ODBC is messy as hell trying to support UCS-2 / UTF-16.

I have tried within Perl to use decode against the string from the DBI fetch, but it just errors...

[error] Caught exception in My_App::Controller:: XLSExport->export "Wide character in subroutine entry at C:/Perl/lib/Encode.pm line 176."

And that was using UTF-16 and UCS-2 / UCS-2BE encoding types, is seems as per the encode docs "Bogus surrogates result in death."

And I guess as utf8::is_utf8 comes back as true and the fact it states on the DBD::ODBC docs...

" Perl scalars which are UTF-8 and are sent through the ODBC API will be converted to UTF-16 and passed to the ODBC wide APIs or signalled as SQL_WCHARs (e.g., in the case of bound columns). Retrieved data which are wide characters are converted from UTF-16 to UTF-8. However, you should realise most ODBC drivers do not support UTF-16"

It seems the DBD::ODBC driver is auto converting the data from UTF-16 to UTF-8, and that is where the corruption must be occurring, it even states that UTF-16 isn't really supported under ODBC?

So what I have is UTF-8, but after a conversion by the ODBC driver from UCS-2 using UTF-16 encoding, which appears to be making a mess of things.

I have checked and I was running DBI v 1.616, so installed latest version 1.627 plus upgraded DBD::ODBC to 1.43 but exactly the same problem!

I'm at a loss what else I can do other than throw DBI away and refactor to Win32::ODBC , it seems DBI::DBD::ODBC is converting the data on retrieval into UTF-8 and making a pigs ear of it! I can't see any simple switch or parameter to tell DBD::ODBC to do things differently, can you?

There is a whole bunch of Unicode maybe's and caveats on the DBD::ODBC CPAN docs, yet if you scan the Win32::ODBC CPAN docs for a mention of UTF-8 / UTF-16 / UCS-2 or the word Unicode, none of them are mentioned even once, so is that a significant indicator of something?

I've checked with DBD::ODBC and Win32::ODBC , that the ODBC.pm is not the same module being used, I can see a flag in the DBI ODBC.pm module of "odbc_has_unicode", but it is an read only  attribute.

Can you suggest anything else that might prevent DBI from corrupting the data on UTF-8 conversion?


-----Original Message-----
From: Mike Whitaker [mailto:mike at altrion.org]
Sent: 04 July 2013 10:35
To: The elegant MVC web framework
Subject: Re: [Catalyst] CSV / UTF-8 / Unicode

On 4 Jul 2013, at 10:30, Mike Whitaker <mike at altrion.org> wrote:
> On 4 Jul 2013, at 09:56, Craig Chant <craig at homeloanpartnership.com> wrote:
>> Yes it's NVARCHAR(max) , which I understood is MS's data-typing for uNicode VARiable CHARacters, looking at some sample column data via the Windows SQL Management GUI, it appears to display ok.
> It probably isn't UTF-8, though. UTF-8 is only one possible encoding in which you can store unicode character points.

Try, on the off-chance I've read the spec right

- not bothering with mysql_encode_utf8 in the DBI connect args
- passing all data from DBI through decode("UTF-16",...) or decode("UCS-2"...) - MS's docs aren't that clear which!
- reencoding it as utf8 on the way out.

List: Catalyst at lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
This Email and any attachments contain confidential information and is intended solely for the individual to whom it is addressed. If this Email has been misdirected, please notify the author as soon as possible. If you are not the intended recipient you must not disclose, distribute, copy, print or rely on any of the information contained, and all copies must be deleted immediately. Whilst we take reasonable steps to try to identify any software viruses, any attachments to this e-mail may nevertheless contain viruses, which our anti-virus software has failed to identify. You should therefore carry out your own anti-virus checks before opening any documents. HomeLoan Partnership will not accept any liability for damage caused by computer viruses emanating from any attachment or other document supplied with this e-mail. HomeLoan Partnership reserves the right to monitor and archive all e-mail communications through its network. No representative or employee of HomeLoan Partnership has the authority to enter into any contract on behalf of HomeLoan Partnership by email. HomeLoan Partnership is a trading name of H L Partnership Limited, registered in England and Wales with Registration Number 5011722. Registered office: 26-34 Old Street, London, EC1V 9QQ. H L Partnership Limited is authorised and regulated by the Financial Conduct Authority.

More information about the Catalyst mailing list