UTF-16 decoder not working as expected

I have a part of my Unicode library that decodes UTF-16 into raw Unicode code points. However, it isn't working as expected.

Here's the relevant part of the code (omitting UTF-8 and string manipulation stuff):

typedef struct string {
    unsigned long length;
    unsigned *data;
} string;

string *upush(string *s, unsigned c) {
    if (!s->length) s->data = (unsigned *) malloc((s->length = 1) * sizeof(unsigned));
    else            s->data = (unsigned *) realloc(s->data, ++s->length * sizeof(unsigned));
    s->data[s->length - 1] = c;
    return s;
}

typedef struct string16 {
    unsigned long length;
    unsigned short *data;
} string16;

string u16tou(string16 old) {
    unsigned long i, cur = 0, need = 0;
    string new;
    new.length = 0;
    for (i = 0; i < old.length; i++)
        if (old.data[i] < 0xd800 || old.data[i] > 0xdfff) upush(&new, old.data[i]);
        else
            if (old.data[i] > 0xdbff && !need) {
                cur = 0; continue;
            } else if (old.data[i] < 0xdc00) {
                need = 1;
                cur = (old.data[i] & 0x3ff) << 10;
                printf("cur 1: %lx\n", cur);
            } else if (old.data[i] > 0xdbff) {
                cur |= old.data[i] & 0x3ff;
                upush(&new, cur);
                printf("cur 2: %lx\n", cur);
                cur = need = 0;
            }
    return new;
}

How does it work?

string is a struct that holds 32-bit values, and string16 is for 16-bit values like UTF-16. All upush does is add a full Unicode code point to a string, reallocating memory as needed.

u16tou is the part that I'm focusing on. It loops through the string16, passing non-surrogate values through as normal, and converting surrogate pairs into full code points. Misplaced surrogates are ignored.

The first surrogate in a pair has its lowest 10 bits shifted 10 bits to the left, resulting in it forming the high 10 bits of the final code point. The other surrogate has its lowest 10 bits added to the final, and then it is appended to the string.

The problem?

Let's try the highest code point, shall we?

U+10FFFD, the last valid Unicode code point, is encoded as 0xDBFF 0xDFFD in UTF-16. Let's try decoding that.

string16 b;
b.length = 2;
b.data = (unsigned short *) malloc(2 * sizeof(unsigned short));
b.data[0] = 0xdbff;
b.data[1] = 0xdffd;
string a = u16tou(b);
puts(utoc(a));

Using the utoc (not shown; I know it's working (see below)) function to convert it back to a UTF-8 char * for printing, I can see in my terminal that I'm getting U+0FFFFD, not U+10FFFD as a result.

In the calculator

Doing all the conversions manually in gcalctool results in the same, wrong answer. So my syntax itself isn't wrong, but the algorithm is. The algorithm seems right to me though, and yet it's ending in the wrong answer.

What am I doing wrong?

5
задан hippietrail 17 April 2011 в 08:08
поделиться