Loading sql/share/charsets/README +19 −20 Original line number Diff line number Diff line This directory holds configuration files which allow MySQL to work with This directory holds configuration files that enable MySQL to work with different character sets. It contains: *.conf Each conf file contains four tables which describe character types, charset_name.xml Each charset_name.xml file contains information for a simple character set. The information in the file describes character types, lower- and upper-case equivalencies and sorting orders for the character values in the set. Index The Index file lists all of the available charset configurations. Index.xml The Index.xml file lists all of the available charset configurations, including collations. Each charset is paired with a number. The number is stored IN THE DATABASE TABLE FILES and must not be changed. Always add new character sets to the end of the list, so that the numbers of the other character sets will not be changed. Each collation must have a unique number. The number is stored IN THE DATABASE TABLE FILES and must not be changed. The max-id attribute of the <charsets> element must be set to the largest collation number. Compiled in or configuration file? When should a character set be compiled in to MySQL's string library (libmystrings), and when should it be placed in a configuration file? (libmystrings), and when should it be placed in a charset_name.xml configuration file? If the character set requires the strcoll functions or is a multi-byte character set, it MUST be compiled in to the string library. If it does not require these functions, it should be placed in a configuration file. placed in a charset_name.xml configuration file. If the character set uses any one of the strcoll functions, it must define all of them. Likewise, if the set uses one of the Loading @@ -30,11 +33,7 @@ Compiled in or configuration file? more information on how to add a complex character set to MySQL. Syntax of configuration files The syntax is very simple. Comments start with a '#' character and proceed to the end of the line. Words are separated by arbitrary amounts of whitespace. For the character set configuration files, every word must be a number in hexadecimal format. The ctype array takes up the first 257 words; the to_lower, to_upper and sort_order arrays take up 256 words each after that. The syntax is very simple. Words in <map> array elements are separated by arbitrary amounts of whitespace. Each word must be a number in hexadecimal format. The ctype array has 257 words; the other arrays (lower, upper, etc.) take up 256 words each after that. strings/CHARSET_INFO.txt +94 −48 Original line number Diff line number Diff line Loading @@ -3,9 +3,8 @@ CHARSET_INFO ============ A structure containing data for charset+collation pair implementation. Virtual functions which use this data are collected into separate structures MY_CHARSET_HANDLER and MY_COLLATION_HANDLER. Virtual functions that use this data are collected into separate structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER. typedef struct charset_info_st Loading Loading @@ -56,7 +55,7 @@ character set. Not really used now. Intended to optimize some parts of the code where we need to find the default collation using its non-default counterpart for the given character set. binary_numner - ID of a charset+collation pair, which consists binary_number - ID of a charset+collation pair, which consists of the same character set and the binary collation of this character set. Not really used now. Loading @@ -65,15 +64,15 @@ Names csname - name of the character set for this charset+collation pair. name - name of the collation for this charset+collation pair. comment - a text comment, dysplayed in "Description" column of comment - a text comment, displayed in "Description" column of SHOW CHARACTER SET output. Conversion tables ----------------- ctype - pointer to array[257] of "type of characters" bit mask for each chatacter, e.g. if a character is a digit or a letter or a separator, etc. bit mask for each character, e.g., whether a character is a digit, letter, separator, etc. Monty 2004-10-21: If you look at the macros, we use ctype[(char)+1]. Loading @@ -87,17 +86,64 @@ Conversion tables to_upper - pointer to array[256] used in UCASE() sort_order - pointer to array[256] used for strings comparison In all Asian charsets these arrays are set up as follows: - All bytes in the range 0x80..0xFF were marked as letters in the ctype array. - The to_lower and to_upper arrays map only ASCII letters. UPPER() and LOWER() doesn't really work for multi-byte characters. Most of the characters in Asian character sets are ideograms anyway and they don't have case mapping. However, there are still some characters from European alphabets. For example: _ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE _ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE But they don't map to each other with UPPER and LOWER operations. - The sort_order array is filled case insensitively for the ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte range 0x80..0xFF for these collations: cp932_japanese_ci, euckr_korean_ci, eucjpms_japanese_ci, gb2312_chinese_ci, sjis_japanese_ci, ujis_japanese_ci. So multi-byte characters are sorted just according to their codes. - Two collations are still case insensitive for the ASCII characters, but have special sorting order for multi-byte characters (something more complex than just according to codes): big5_chinese_ci gbk_chinese_ci So handlers for these collations use only the 0x00..0x7F part of their sort_order arrays, and apply the special functions for multi-byte characters In Unicode character sets we have full support of UPPER/LOWER mapping, for sorting order, and for character type detection. "utf8_general_ci" still has the "old-fashioned" arrays like to_upper, to_lower, sort_order and ctype, but they are not really used (maybe only in some rare legacy functions). Unicode conversion data ----------------------- For 8bit character sets: For 8-bit character sets: tab_to_uni : array[256] of charset->Unicode translation tab_from_uni: a structure for Unicode->charset translation Non-8 bit charsets have their own structures per charset hidden in correspondent ctype-xxx.c file and don't use Non-8-bit charsets have their own structures per charset hidden in corresponding ctype-xxx.c file and don't use tab_to_uni and tab_from_uni tables. Loading @@ -106,9 +152,9 @@ Parser maps state_map[] ident_map[] These maps are to quickly identify if a character is an identificator part, a digit, a special character, or a part of other SQL language lexical item. These maps are used to quickly identify whether a character is an identifier part, a digit, a special character, or a part of another SQL language lexical item. Probably can be combined with ctype array in the future. But for some reasons these two arrays are used in the parser, Loading @@ -116,32 +162,32 @@ while a separate ctype[] array is used in the other part of the code, like fulltext, etc. Misc fields ----------- Miscellaneous fields -------------------- strxfrm_multiply - how many times a sort key (i.e. a string which can be passed into memcmp() for comparison) strxfrm_multiply - how many times a sort key (that is, a string that can be passed into memcmp() for comparison) can be longer than the original string. Usually it is 1. For some complex collations it can be bigger. For example collations it can be bigger. For example, in latin1_german2_ci, a sort key is up to twice longer than the original string. two times longer than the original string. e.g. Letter 'A' with two dots above is substituted with 'AE'. mbminlen - mininum multibyte sequence length. Now always 1 except ucs2. For ucs2 mbminlen - minimum multi-byte sequence length. Now always 1 except for ucs2. For ucs2, it is 2. mbmaxlen - maximum multibyte sequence length. 1 for 8bit charsets. Can be also 2 or 3. mbmaxlen - maximum multi-byte sequence length. 1 for 8-bit charsets. Can be also 2 or 3. max_sort_char - for LIKE range in case of 8bit character sets - native code in case of 8-bit character sets - native code of maximum character (max_str pad byte); in case of UTF8 and UCS2 - Unicode code of the maximum possible character (usually U+FFFF). This code is converted to multibyte representation (usually 0xEFBFBF) converted to multi-byte representation (usually 0xEFBFBF) and then used as a pad sequence for max_str. in case of other multibyte character sets - in case of other multi-byte character sets - max_str pad byte (usually 0xFF). MY_CHARSET_HANDLER Loading @@ -151,10 +197,10 @@ MY_CHARSET_HANDLER is a collection of character-set related routines. Defined in m_ctype.h. Have the following set of functions: Multibyte routines Multi-byte routines ------------------ ismbchar() - detects if the given string is a multibyte sequence mbcharlen() - returns length of multibyte sequence starting with ismbchar() - detects whether the given string is a multi-byte sequence mbcharlen() - returns length of multi-byte sequence starting with the given character numchars() - returns number of characters in the given string, e.g. in SQL function CHAR_LENGTH(). Loading @@ -163,29 +209,29 @@ charpos() - calculates the offset of the given position in the string. INSERT() well_formed_length() - finds the length of correctly formed multybyte beginning. - finds the length of correctly formed multi-byte beginning. Used in INSERTs to cut a beginning of the given string which is a) "well formed" according to the given character set. b) can fit into the given data type Terminates the string in the good position, taking in account multibyte character boundaries. multi-byte character boundaries. lengthsp() - returns the length of the given string without traling spaces. lengthsp() - returns the length of the given string without trailing spaces. Unicode conversion routines --------------------------- mb_wc - converts the left multibyte sequence into it Unicode code. mc_mb - converts the given Unicode code into multibyte sequence. mb_wc - converts the left multi-byte sequence into its Unicode code. mc_mb - converts the given Unicode code into multi-byte sequence. Case and sort convertion Case and sort conversion ------------------------ caseup_str - converts the given 0-terminated string into the upper case casedn_str - converts the given 0-terminated string into the lower case caseup - converts the given string into the lower case using length casedn - converts the given string into the lower case using length caseup_str - converts the given 0-terminated string to uppercase casedn_str - converts the given 0-terminated string to lowercase caseup - converts the given string to lowercase using length casedn - converts the given string to lowercase using length Number-to-string conversion routines ------------------------------------ Loading @@ -193,7 +239,7 @@ snprintf() long10_to_str() longlong10_to_str() The names are pretty self-descripting. The names are pretty self-describing. String padding routines ----------------------- Loading @@ -201,7 +247,7 @@ fill() - writes the given Unicode value into the given string with the given length. Used to pad the string, usually with space character, according to the given charset. String-to-numner conversion routines String-to-number conversion routines ------------------------------------ strntol() strntoul() Loading @@ -209,10 +255,10 @@ strntoll() strntoull() strntod() These functions are almost for the same thing with their STDLIB counterparts, but also: These functions are almost the same as their STDLIB counterparts, but also: - accept length instead of 0-terminator - and are character set dependant - are character set dependent Simple scanner routines ----------------------- Loading @@ -230,8 +276,8 @@ strnxfrm() - makes a sort key suitable for memcmp() corresponding like_range() - creates a LIKE range, for optimizer wildcmp() - wildcard comparison, for LIKE strcasecmp() - 0-terminated string comparison instr() - finds the first substring appearence in the string hash_sort() - calculates hash value taking in account instr() - finds the first substring appearance in the string hash_sort() - calculates hash value taking into account the collation rules, e.g. case-insensitivity, accent sensitivity, etc. Loading Loading
sql/share/charsets/README +19 −20 Original line number Diff line number Diff line This directory holds configuration files which allow MySQL to work with This directory holds configuration files that enable MySQL to work with different character sets. It contains: *.conf Each conf file contains four tables which describe character types, charset_name.xml Each charset_name.xml file contains information for a simple character set. The information in the file describes character types, lower- and upper-case equivalencies and sorting orders for the character values in the set. Index The Index file lists all of the available charset configurations. Index.xml The Index.xml file lists all of the available charset configurations, including collations. Each charset is paired with a number. The number is stored IN THE DATABASE TABLE FILES and must not be changed. Always add new character sets to the end of the list, so that the numbers of the other character sets will not be changed. Each collation must have a unique number. The number is stored IN THE DATABASE TABLE FILES and must not be changed. The max-id attribute of the <charsets> element must be set to the largest collation number. Compiled in or configuration file? When should a character set be compiled in to MySQL's string library (libmystrings), and when should it be placed in a configuration file? (libmystrings), and when should it be placed in a charset_name.xml configuration file? If the character set requires the strcoll functions or is a multi-byte character set, it MUST be compiled in to the string library. If it does not require these functions, it should be placed in a configuration file. placed in a charset_name.xml configuration file. If the character set uses any one of the strcoll functions, it must define all of them. Likewise, if the set uses one of the Loading @@ -30,11 +33,7 @@ Compiled in or configuration file? more information on how to add a complex character set to MySQL. Syntax of configuration files The syntax is very simple. Comments start with a '#' character and proceed to the end of the line. Words are separated by arbitrary amounts of whitespace. For the character set configuration files, every word must be a number in hexadecimal format. The ctype array takes up the first 257 words; the to_lower, to_upper and sort_order arrays take up 256 words each after that. The syntax is very simple. Words in <map> array elements are separated by arbitrary amounts of whitespace. Each word must be a number in hexadecimal format. The ctype array has 257 words; the other arrays (lower, upper, etc.) take up 256 words each after that.
strings/CHARSET_INFO.txt +94 −48 Original line number Diff line number Diff line Loading @@ -3,9 +3,8 @@ CHARSET_INFO ============ A structure containing data for charset+collation pair implementation. Virtual functions which use this data are collected into separate structures MY_CHARSET_HANDLER and MY_COLLATION_HANDLER. Virtual functions that use this data are collected into separate structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER. typedef struct charset_info_st Loading Loading @@ -56,7 +55,7 @@ character set. Not really used now. Intended to optimize some parts of the code where we need to find the default collation using its non-default counterpart for the given character set. binary_numner - ID of a charset+collation pair, which consists binary_number - ID of a charset+collation pair, which consists of the same character set and the binary collation of this character set. Not really used now. Loading @@ -65,15 +64,15 @@ Names csname - name of the character set for this charset+collation pair. name - name of the collation for this charset+collation pair. comment - a text comment, dysplayed in "Description" column of comment - a text comment, displayed in "Description" column of SHOW CHARACTER SET output. Conversion tables ----------------- ctype - pointer to array[257] of "type of characters" bit mask for each chatacter, e.g. if a character is a digit or a letter or a separator, etc. bit mask for each character, e.g., whether a character is a digit, letter, separator, etc. Monty 2004-10-21: If you look at the macros, we use ctype[(char)+1]. Loading @@ -87,17 +86,64 @@ Conversion tables to_upper - pointer to array[256] used in UCASE() sort_order - pointer to array[256] used for strings comparison In all Asian charsets these arrays are set up as follows: - All bytes in the range 0x80..0xFF were marked as letters in the ctype array. - The to_lower and to_upper arrays map only ASCII letters. UPPER() and LOWER() doesn't really work for multi-byte characters. Most of the characters in Asian character sets are ideograms anyway and they don't have case mapping. However, there are still some characters from European alphabets. For example: _ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE _ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE But they don't map to each other with UPPER and LOWER operations. - The sort_order array is filled case insensitively for the ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte range 0x80..0xFF for these collations: cp932_japanese_ci, euckr_korean_ci, eucjpms_japanese_ci, gb2312_chinese_ci, sjis_japanese_ci, ujis_japanese_ci. So multi-byte characters are sorted just according to their codes. - Two collations are still case insensitive for the ASCII characters, but have special sorting order for multi-byte characters (something more complex than just according to codes): big5_chinese_ci gbk_chinese_ci So handlers for these collations use only the 0x00..0x7F part of their sort_order arrays, and apply the special functions for multi-byte characters In Unicode character sets we have full support of UPPER/LOWER mapping, for sorting order, and for character type detection. "utf8_general_ci" still has the "old-fashioned" arrays like to_upper, to_lower, sort_order and ctype, but they are not really used (maybe only in some rare legacy functions). Unicode conversion data ----------------------- For 8bit character sets: For 8-bit character sets: tab_to_uni : array[256] of charset->Unicode translation tab_from_uni: a structure for Unicode->charset translation Non-8 bit charsets have their own structures per charset hidden in correspondent ctype-xxx.c file and don't use Non-8-bit charsets have their own structures per charset hidden in corresponding ctype-xxx.c file and don't use tab_to_uni and tab_from_uni tables. Loading @@ -106,9 +152,9 @@ Parser maps state_map[] ident_map[] These maps are to quickly identify if a character is an identificator part, a digit, a special character, or a part of other SQL language lexical item. These maps are used to quickly identify whether a character is an identifier part, a digit, a special character, or a part of another SQL language lexical item. Probably can be combined with ctype array in the future. But for some reasons these two arrays are used in the parser, Loading @@ -116,32 +162,32 @@ while a separate ctype[] array is used in the other part of the code, like fulltext, etc. Misc fields ----------- Miscellaneous fields -------------------- strxfrm_multiply - how many times a sort key (i.e. a string which can be passed into memcmp() for comparison) strxfrm_multiply - how many times a sort key (that is, a string that can be passed into memcmp() for comparison) can be longer than the original string. Usually it is 1. For some complex collations it can be bigger. For example collations it can be bigger. For example, in latin1_german2_ci, a sort key is up to twice longer than the original string. two times longer than the original string. e.g. Letter 'A' with two dots above is substituted with 'AE'. mbminlen - mininum multibyte sequence length. Now always 1 except ucs2. For ucs2 mbminlen - minimum multi-byte sequence length. Now always 1 except for ucs2. For ucs2, it is 2. mbmaxlen - maximum multibyte sequence length. 1 for 8bit charsets. Can be also 2 or 3. mbmaxlen - maximum multi-byte sequence length. 1 for 8-bit charsets. Can be also 2 or 3. max_sort_char - for LIKE range in case of 8bit character sets - native code in case of 8-bit character sets - native code of maximum character (max_str pad byte); in case of UTF8 and UCS2 - Unicode code of the maximum possible character (usually U+FFFF). This code is converted to multibyte representation (usually 0xEFBFBF) converted to multi-byte representation (usually 0xEFBFBF) and then used as a pad sequence for max_str. in case of other multibyte character sets - in case of other multi-byte character sets - max_str pad byte (usually 0xFF). MY_CHARSET_HANDLER Loading @@ -151,10 +197,10 @@ MY_CHARSET_HANDLER is a collection of character-set related routines. Defined in m_ctype.h. Have the following set of functions: Multibyte routines Multi-byte routines ------------------ ismbchar() - detects if the given string is a multibyte sequence mbcharlen() - returns length of multibyte sequence starting with ismbchar() - detects whether the given string is a multi-byte sequence mbcharlen() - returns length of multi-byte sequence starting with the given character numchars() - returns number of characters in the given string, e.g. in SQL function CHAR_LENGTH(). Loading @@ -163,29 +209,29 @@ charpos() - calculates the offset of the given position in the string. INSERT() well_formed_length() - finds the length of correctly formed multybyte beginning. - finds the length of correctly formed multi-byte beginning. Used in INSERTs to cut a beginning of the given string which is a) "well formed" according to the given character set. b) can fit into the given data type Terminates the string in the good position, taking in account multibyte character boundaries. multi-byte character boundaries. lengthsp() - returns the length of the given string without traling spaces. lengthsp() - returns the length of the given string without trailing spaces. Unicode conversion routines --------------------------- mb_wc - converts the left multibyte sequence into it Unicode code. mc_mb - converts the given Unicode code into multibyte sequence. mb_wc - converts the left multi-byte sequence into its Unicode code. mc_mb - converts the given Unicode code into multi-byte sequence. Case and sort convertion Case and sort conversion ------------------------ caseup_str - converts the given 0-terminated string into the upper case casedn_str - converts the given 0-terminated string into the lower case caseup - converts the given string into the lower case using length casedn - converts the given string into the lower case using length caseup_str - converts the given 0-terminated string to uppercase casedn_str - converts the given 0-terminated string to lowercase caseup - converts the given string to lowercase using length casedn - converts the given string to lowercase using length Number-to-string conversion routines ------------------------------------ Loading @@ -193,7 +239,7 @@ snprintf() long10_to_str() longlong10_to_str() The names are pretty self-descripting. The names are pretty self-describing. String padding routines ----------------------- Loading @@ -201,7 +247,7 @@ fill() - writes the given Unicode value into the given string with the given length. Used to pad the string, usually with space character, according to the given charset. String-to-numner conversion routines String-to-number conversion routines ------------------------------------ strntol() strntoul() Loading @@ -209,10 +255,10 @@ strntoll() strntoull() strntod() These functions are almost for the same thing with their STDLIB counterparts, but also: These functions are almost the same as their STDLIB counterparts, but also: - accept length instead of 0-terminator - and are character set dependant - are character set dependent Simple scanner routines ----------------------- Loading @@ -230,8 +276,8 @@ strnxfrm() - makes a sort key suitable for memcmp() corresponding like_range() - creates a LIKE range, for optimizer wildcmp() - wildcard comparison, for LIKE strcasecmp() - 0-terminated string comparison instr() - finds the first substring appearence in the string hash_sort() - calculates hash value taking in account instr() - finds the first substring appearance in the string hash_sort() - calculates hash value taking into account the collation rules, e.g. case-insensitivity, accent sensitivity, etc. Loading