Unicode patch for DBD::ODBC 1.13

Date: 2006-03-04

Goals

Legal disclaimer

The Unicode patch for DBD::ODBC was initially written by Alexander Foken (alexander at foken dot de). I wrote this patch during my working hours for a project that needs Unicode support in DBD::ODBC on Windows, and I have the permission of my former employer to publish this patch.

You may distribute under the terms of either the GNU General Public License or the Artistic License, as specified in the Perl README file.

Applying the Patch

  1. get DBD::ODBC 1.13 from CPAN
  2. unpack it
  3. patch -p1 < DBD-ODBC-1.13-Unicode.patch
  4. build DBD::ODBC as usual (perl Makefile.pl, make, make test, make install)

Tested Databases

Microsoft SQL Server 2000 SP3
Works, using the ODBC driver from the MS SQL 2000 CDROM.

You need to install SP1, SP2, and SP3 on both client and server, because DBD::ODBC needs a recent set of MDAC libraries.

Oracle 9.2
Works, using the ODBC driver from the Oracle 9.2 CDROM.

You have to set the environment variables NLS_NCHAR=AL32UTF8 and NLS_LANG=AMERICAN_AMERICA.AL32UTF8 (or any other language setting ending with ``.AL32UTF8'') before loading the patched DBD::ODBC to make Oracle return Unicode data. (See also ``Oracle and Unicode'' in the POD of DBD::Oracle.)

And you have to enable the ``Force SQL_WCHAR support'' Workaround in the Oracle ODBC Driver Configuration to make Oracle return Unicode to a non-Unicode application. Alternatively, you can include ``FWC=T'' in your connect string.

Better try to use DBD::Oracle to access Oracle with less overhead and better support for Oracle-specific stuff.

Oracle 8.x
Not tested but known not to support Unicode very well.

Quoting the DBD::Oracle man page: ``Oracle 8 client libraries have a number of bugs related to character set handling, especially when connected to an Oracle 9+ server. [...] If you wish to use Unicode, I recommend upgrading client and server to Oracle 9 or later.''

MS Access 2000
Unicode tests fail because MS Access 2000 seems not to give any hints about Unicode, so all data is treated as non-Unicode. You do not want to use this combination.

You may want to try the MSDE, it has the SQL Server engine, but with a lower connection limit and without GUI tools. There are several 3rd party GUIs for the MSDE.

PostgreSQL 8.0.3
Some tests from the original DBD::ODBC 1.13 fail with PostgreSQL 8.0.3, so you may not want to use DBD::ODBC to connect to PostgreSQL 8.0.3.

Unicode tests fail because PostgreSQL seems not to give any hints about Unicode, so all data is treated as non-Unicode.

Better try to use DBD::Pg to access PostgreSQL with less overhead and better support for PostgreSQL-specific stuff. DBD::Pg has a driver attribute named pg_enable_utf8, set it to 1 and you have proper Unicode support.

Tested Operating Systems and ODBC Managers

MS Windows 2000 Professional and Server, using the standard ODBC Manager from Microsoft.

(Yes, this list should be longer.)

Known Problems

Perl 5.8.1 or newer is required. Older Perl before 5.8.0 lacked proper Unicode support. Perl 5.8.0 lacks some auxillary functions for Unicode.

Unicode is supported only for SQL statement parameters and data returned by the fetch methods, SQL statements are still treated as native encoding. If you need a unicode constant in an SQL statement, you have to pass it as parameter or use SQL functions to convert your constant from native encoding to Unicode.

All data passed to the patched DBD::ODBC for SQL_C_WCHAR, SQL_WCHAR, SQL_WVARCHAR, and SQL_WLONGVARCHAR is treated as Unicode, even if it is not Unicode. unicode_helper.c should check the UTF8 flag of the scalar and pass a value different from CP_UTF8 as first argument to WideCharToMultiByte(). The problem is to know what encoding is used for the data in the scalar.

Binding of unicode output parameters is untested (I don't need them) and likely to fail.

The patched DBD::ODBC may fail to compile on non-Win32 platforms. It needs a header file named wchar.h defining at least the following:

A WCHAR data type capable of storing a single Unicode character.
Microsoft uses typedef wchar_t WCHAR in wchar.h, and typedef unsigned short wchar_t; in ctype.h.

WCHAR * wcscpy(WCHAR *, const WCHAR *)
strcpy() for wide characters. Microsoft declares this function in both string.h and wchar.h.

size_t wcslen(const WCHAR *)
strlen() for wide characters, returns character count, not bytes. Microsoft declares this function in both string.h and wchar.h.

WideCharToMultiByte() and MultiByteToWideChar() functions
Encoding converter functions. WideChar in this context means the native Unicode representation of the ODBC API (UCS-2 or UTF-16LE for Windows), MultiByte in this context means Perls native Unicode representation (UTF-8). Microsoft declares the two functions in winnls.h. The CP_UTF8 argument tells the function that the MultiByte string is in UTF-8 encoding.

Technical

This patch adds support for SQL_C_WCHAR, SQL_WCHAR, SQL_WVARCHAR, and SQL_WLONGVARCHAR. Strings written to columns that are reported as SQL_WCHAR, SQL_WVARCHAR, or SQL_WLONGVARCHAR are automatically converted to 16 bit Unicode using the Windows API function WideCharToMultiByte(), return values reported as SQL_C_WCHAR, SQL_WCHAR, SQL_WVARCHAR, or SQL_WLONGVARCHAR are converted back to 8 bit Unicode (UTF-8) using the Windows API function MultiByteToWideChar() and have Perl's UTF8 flag set except for empty strings.

Tests

This patch adds two new tests, t/40UnicodeRoundTrip.t and t/41Unicode.t. Test 40 checks that Unicode strings can be entered as bound parameters and are returned unmodified. Test 41 creates a table, writes and reads unicode data using various bind variants.

When using Oracle, the empty string test in t/40UnicodeRoundTrip.t is skipped because Oracle converts empty strings to NULL in this situation.

I had to add SQL_WCHAR, SQL_WVARCHAR, SQL_WLONGVARCHAR to t/ODBCTEST.pm, because Oracle in the setup described above returns Unicode more often than expected.

I added t/UChelp.pm, that exports two utility functions for unicode string tests:

dumpstr($)
Dumps a string, indicating its Unicode flag, length and all characters in ASCII notation.

utf_eq_ok($$$)
Compares two strings that may be contain Unicode, and calls pass() or cmp_ok().

See also