How to know the preferred display width (in columns) of Unicode characters?

In different encodings of Unicode, for example UTF-16le or UTF-8, a character may occupy 2 or 3 bytes. Many Unicode applications doesn't take care of display width of Unicode chars just like they are all Latin letters. For example, in 80-column text, which should contains 40 Chinese characters or 80 Latin letters in one line, but most application (like Eclipse, Notepad++, and all well-known text editors, I dare if there's any good exception) just count each Chinese character as 1 width as Latin letter. This certainly make the result format ugly and non-aligned.

For example, a tab-width of 8 will get the following ugly result (count all Unicode as 1 display width):

apple   10
banana  7
苹果      6
猕猴桃     31
pear    16

However, the expected format is (Count each Chinese character as 2 width):

apple   10
banana  7
苹果    6
猕猴桃  31
pear    16

The improper calculation on display width of chars make these editors totally useless when doing tab-align, and line wrapping and paragraph reformat.

Though, the width of a character may vary between different fonts, but in all cases of Fixed-size terminal font, Chinese character is always double width. That is to say, in despite of font, each Chinese character is preferred to display in 2 width.

One of solution is, I can get the correct width by convert the encoding to GB2312, in GB2312 encoding each Chinese character takes 2 bytes. however, some Unicode characters doesn't exist in GB2312 charset (or GBK charset). And, in general it's not a good idea to compute the display width from the encoded size in bytes.

To simply calculate all character in Unicode in range of (\u0080..\uFFFF) as 2 width is also not correct, because there're also many 1-width chars scattered in the range.

There's also difficult when calculate the display width of Arabic letters and Korean letters, because they construct a word/character by arbitrary number of Unicode code points.

So, the display width of a Unicode code point maybe not an integer, I deem that is ok, they can be grounded to integer in practice, at least better than none.

So, is there any attribute related to the preferred display width of a char in Unicode standard? Or any Java library function to calculate the display width?

16
задан Xiè Jìléi 26 July 2012 в 11:30
поделиться