I am curious about character sets and collations, especially how they are used in databases. I got some time to play with them recently. I did some testing today on MySQL. I will do the same test on Sql Server, Oracle, and PostgreSql, time permitting. I am only dealing with simplified Chinese at this point. I may take up traditional Chinese too.
Here is my setup:
1. I created a table that stores simplified Chinese characters in different character set, along with collation used, pinyin, number of strokes, and tone value. There are 126 collations in MySQL, only 10 of which are suited for simplified Chinese.
2. I used the Chinese version of the golden rule, 己所不欲, 勿施于人, from Confucius, as my sample characters. Of those 8 characters, 欲 and 于 have the same pinyin. To make the test more interesting, I added one more character, 旗, since it has the same radical as 施.
Here are the findings:
1. I don’t see any difference between the _bin and _chinese_ci collations. In all cases, they produced identical results;
2. Big5 sorts on number of strokes. gb2312 and gbk sort on pinyin. Yes, all of this is stated in the documentation, but I want to see it for myself;
3. ucs2 and utf8 collations generate the same sort results. Interestingly, it is only through those two that characters share the same radicals are sorted next to each other, at least based on my sample. I tend to believe that should be true in general. Another interesting thing is that with ucs2 and utf8, all 独体字 are sorted together, and all 合体字 are sorted together, at least based on my sample.
4. (This is an updated item, thanks to the comment from reader Jedy) However, if you have any English letter as a beginning of the string, regardless of the string is all English letters or English mixed with Chinese characters, the _bin and _CI collation do make a difference. For _bin collations, capital English letters always come first. For _CI collations, non-capitalized English letters come first.
Below is the script I used and the results, if you are interested.
SQL query: SELECT big5_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY big5_bin
big5_bin | PinYin | StrokeNo | Tone |
---|---|---|---|
人 | ren | 2 | 2 |
于 | yu | 3 | 2 |
己 | ji | 3 | 3 |
不 | bu | 4 | 2 |
勿 | wu | 4 | 4 |
所 | suo | 8 | 3 |
施 | shi | 9 | 1 |
欲 | yu | 11 | 4 |
旗 | qi | 14 | 2 |
SQL query: SELECT big5_chinese_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY big5_chinese_ci
big5_chinese_ci | PinYin | StrokeNo | Tone |
---|---|---|---|
人 | ren | 2 | 2 |
己 | ji | 3 | 3 |
于 | yu | 3 | 2 |
不 | bu | 4 | 2 |
勿 | wu | 4 | 4 |
所 | suo | 8 | 3 |
施 | shi | 9 | 1 |
欲 | yu | 11 | 4 |
旗 | qi | 14 | 2 |
SQL query: SELECT gb2312_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY gb2312_bin
gb2312_bin | PinYin | StrokeNo | Tone |
---|---|---|---|
不 | bu | 4 | 2 |
己 | ji | 3 | 3 |
旗 | qi | 14 | 2 |
人 | ren | 2 | 2 |
施 | shi | 9 | 1 |
所 | suo | 8 | 3 |
勿 | wu | 4 | 4 |
于 | yu | 3 | 2 |
欲 | yu | 11 | 4 |
SQL query: SELECT gb2312_chinese_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY gb2312_chinese_ci
gb2312_chinese_ci | PinYin | StrokeNo | Tone |
---|---|---|---|
不 | bu | 4 | 2 |
己 | ji | 3 | 3 |
旗 | qi | 14 | 2 |
人 | ren | 2 | 2 |
施 | shi | 9 | 1 |
所 | suo | 8 | 3 |
勿 | wu | 4 | 4 |
于 | yu | 3 | 2 |
欲 | yu | 11 | 4 |
SQL query: SELECT gbk_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY gbk_bin
gbk_bin | PinYin | StrokeNo | Tone |
---|---|---|---|
不 | bu | 4 | 2 |
己 | ji | 3 | 3 |
旗 | qi | 14 | 2 |
人 | ren | 2 | 2 |
施 | shi | 9 | 1 |
所 | suo | 8 | 3 |
勿 | wu | 4 | 4 |
于 | yu | 3 | 2 |
欲 | yu | 11 | 4 |
SQL query: SELECT gbk_chinese_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY gbk_chinese_ci
gbk_chinese_ci | PinYin | StrokeNo | Tone |
---|---|---|---|
不 | bu | 4 | 2 |
己 | ji | 3 | 3 |
旗 | qi | 14 | 2 |
人 | ren | 2 | 2 |
施 | shi | 9 | 1 |
所 | suo | 8 | 3 |
勿 | wu | 4 | 4 |
于 | yu | 3 | 2 |
欲 | yu | 11 | 4 |
SQL query: SELECT ucs2_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY ucs2_bin
ucs2_bin | PinYin | StrokeNo | Tone |
---|---|---|---|
不 | bu | 4 | 2 |
于 | yu | 3 | 2 |
人 | ren | 2 | 2 |
勿 | wu | 4 | 4 |
己 | ji | 3 | 3 |
所 | suo | 8 | 3 |
施 | shi | 9 | 1 |
旗 | qi | 14 | 2 |
欲 | yu | 11 | 4 |
SQL query: SELECT ucs2_unicode_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY ucs2_unicode_ci
ucs2_unicode_ci | PinYin | StrokeNo | Tone |
---|---|---|---|
不 | bu | 4 | 2 |
于 | yu | 3 | 2 |
人 | ren | 2 | 2 |
勿 | wu | 4 | 4 |
己 | ji | 3 | 3 |
所 | suo | 8 | 3 |
施 | shi | 9 | 1 |
旗 | qi | 14 | 2 |
欲 | yu | 11 | 4 |
SQL query: SELECT utf8_bin , PinYin , StrokeNo , Tone FROM t1 ORDER BY utf8_bin
utf8_bin | PinYin | StrokeNo | Tone |
---|---|---|---|
不 | bu | 4 | 2 |
于 | yu | 3 | 2 |
人 | ren | 2 | 2 |
勿 | wu | 4 | 4 |
己 | ji | 3 | 3 |
所 | suo | 8 | 3 |
施 | shi | 9 | 1 |
旗 | qi | 14 | 2 |
欲 | yu | 11 | 4 |
SQL query: SELECT utf8_unicode_ci , PinYin , StrokeNo , Tone FROM t1 ORDER BY utf8_unicode_ci
utf8_unicode_ci | PinYin | StrokeNo | Tone |
---|---|---|---|
不 | bu | 4 | 2 |
于 | yu | 3 | 2 |
人 | ren | 2 | 2 |
勿 | wu | 4 | 4 |
己 | ji | 3 | 3 |
所 | suo | 8 | 3 |
施 | shi | 9 | 1 |
旗 | qi | 14 | 2 |
欲 | yu | 11 | 4 |
3 responses to “Character set and collation for simplified Chinese — MySQL”
当字段中有英文字符时,gb2312_bin和gb2312_chinese_ci的排序结果不同。
谢谢你,Jedy!你的评论给了我启发,我会再做一些实验看一看。
Jedy,
我重新做了实验。谢谢你的启发。我已经把新的结果补充到上面第四条。