What are the consequences of storing a C# string (UTF-16) in a SQL Server nvarchar (UCS-2) column?

It seems that SQL Server uses Unicode UCS-2, a 2-byte fixed-length character encoding, for nchar/nvarchar fields. Meanwhile, C# uses Unicode UTF-16 encoding for its strings (note: Some people don't consider UCS-2 to be Unicode, but it encodes all the same code points as UTF-16 in the Unicode subset 0-0xFFFF, and as far as SQL Server is concerned, that's the closest thing to "Unicode" it natively supports in terms of character strings.)

While UCS-2 encodes the same basic code points as UTF-16 in the Basic Multilingual Plane (BMP), it doesn't reserve certain bit patterns that UTF-16 does to allow for surrogate pairs.

If I write a C# string to an SQL Server nvarchar (UCS-2) field and read it back, will this always return the same result?

It seems that while UTF-16 is a superset of UCS-2 in the sense that UTF-16 encodes more code points (e.g. above 0xFFFF), it's actually a sub-set of UCS-2 at the 2-byte level, since it's more restrictive.

To answer my own question, I suspect that if my C# string contains code points above 0xFFFF (represented by pairs of characters), these would be stored and retrieved just fine in the database, but if I tried to manipulated them in the database (e.g. perhaps calling TOUPPER or attempting to blank-out every other character), then I could run into some problems displaying the string later... unless SQL Server has functions that acknowledge surrogate pairs and effectively treat nchar/nvarchar strings as UTF-16.

9
задан marc_s 13 April 2011 в 09:54
поделиться