|
|
|
|
UnicodeDevelopment must use Unicode. This is an easy change in that it only requires inclusion of both ‘UNICODE’ (for Win32 functions) and ‘_UNICODE’ (for C runtime functions) in each build configuration’s preprocessor definitions list.
Base CharacterThe default C runtime development string type based upon a char, a signed 8 bit data type. This data type is inherently incapable of dealing with all of the characters required to display all human languages. The data type char and functions that use it are to be avoided. Character TypesUnicodeUnicode encapsulates its data character within 16 unsigned bits. This data type is referred to in multiple ways. In the end, the data types resolve to much the same thing.
Microsoft has supplied a macro that resolves to char when in MBCS and wchar_t when in UNICODE. It is recommended that use of the character type char be replaced by TCHAR. TCHAR szString[_MAX_PATH+1]; UTF-8While the intention is to have the entire product line utilize Unicode characters through out, sometimes this is not possible. In particular, the non-Windows standard for handling big characters is UTF-8. UTF-8 is a multi-byte character set where each character is of variable length. This is a huge over simplification but if a particular byte's most significant bit is set then there are more bytes in the character to come. If the current byte's most significant bit is a zero then the current byte is the end of the character. UTF-8 and Unicode strings are not compatible. If you see just the first letter of your string displayed then the most likely problem is that a Unicode string was passed into a function expecting UTF-8. 1.2.2 String ConstantsThere are three kinds of string constants that we care about.
To always get a string constant that is composed of 8 bit characters, enclose the string in naked quotes. “This is an 8 bit character set string constant.” To get a string constant that adapts its data type in a manner appropriate to the UNICODE state, use the Microsoft supplied macro _T(“xx”). Any string within this macro will be narrow when not building in UNICODE and 16 bits when UNICODE. _T(“This is a really creative string example. Wow.”) When a string constant must always be wide character regardless of UNICODE build mode, use L’xx’. L”This string constant is composed of wide characters.” Replace English Specific String RoutinesAll English specific string routines such as strlen, sprintf, etc., need to be replaced by an appropriate locale sensitive routine. See the section on Locale Sensitive Functions and TCHAR String Functions for details. String TypeWhat type of string can be used? Array of wchar_tIt is possible to declare a wchar_t array and to use the wide character version of various string functions on such arrays. It is also possible to remove your own appendix with nothing more than a rusty spoon and good thoughts for anesthesia but there are better techniques available. std::stringSTL’s std::string is an array of signed chars, thus making it inappropriate for Unicode’s requirement for a 16 bit unsigned data type. Although something of a kludge and somewhat misleading to boot, std::string is often used to encapsulate and transport UTF-8 data and arrays of bytes. Neither of these data types is exactly appropriate for a char array but it's easy to do and lazy programmers who have never had to deal with I18N issues often acquire this particular habit. This is a habit that is only slightly harder to kick than methamphetamine but, with love, persistence and lots of electric cattle prods, the offending programmer can be taught the error of his ways, There are problems in using std::string to encapsulate UTF-8. The normal character iterater is of no use because UTF-8 characters may span up to 4 bytes of data. Equally, it is no longer legal to search for a hard coded character using a strtok or find function. Windows does not directly support UTF-8 locale functions in the same manner as Unicode strings are supported. Depending upon the type of string function to be performed, it may be necessary to first convert the UTF-8 string to Unicode, perform the the locale sensitive function and then convert back to UTF-8. std::basic_string<TCHAR>The std::basic_string<TCHAR> template provides the same functionality as std::string but uses an unsigned short as its data type. This template will work admirably as a direct replacement for std::string. While it is possible to use basic_string<TCHAR> in conjunction with string resources, there are some extra considerations to take into account.
CStringThe MFC string class CString natively supports Unicode and is immune to buffer overrun problems. Additionally, its own internal functions such as Format, ToUpper, etc, are already locale sensitive. It operates within the MFC umbrella and so resource load operations will be sensitive to the resource tree. While there are certainly workable alternatives to the CString class, there are none that provide access to resource strings with any where near the same ease of effort. The primary difficulty in the use of CString is that it is part of MFC and, not surprisingly, cannot be used in a module that doesn’t support MFC. Services and out of proc ATL COM servers do not normally support MFC.
|
|
|