Escolar Documentos
Profissional Documentos
Cultura Documentos
Kelley, Migrating to Unicode, Part II Volume 14, Number 5—May 2010 (Special Issue)
Migrating to
Unicode, Part II
By Josh Kelley
P
art I introduced Unicode and covered the vari-
ous options for working with Unicode text in C,
C++, the Windows API, and the VCL. Now, in
Part II, I will specifically discuss how to migrate a
C++Builder application to Unicode. Unicode support. And, even a completely migrated
application may still need to deal with ANSI or UTF-8
when interfacing with legacy file formats or APIs, or
Migrating C++Builder when reading or writing data from or to the disk or
applications to Unicode network.
There are two basic approaches to handling Unicode Regardless of which approach you choose, there
issues when migrating a pre-2009 C++Builder applica- are a number of C and C++ techniques that can be
tion to C++Builder 2009 and above. used to help with a Unicode migration:
1. You can do a complete Unicode migration. Use The Windows API includes functions for convert-
the Windows Unicode APIs instead of ANSI APIs, ing between ANSI and Unicode, and the VCL
replace char with wchar_t, replace std::string provides conversion constructors to easily convert
with std::wstring, and use UnicodeString in- between AnsiString and UnicodeString values.
stead of AnsiString. Depending on how your C The standard Windows header tchar.h includes
and C++ code is written, this could be a major macros designed to let you write code that com-
undertaking. piles as either ANSI or Unicode. This can help
2. You can leave your application using ANSI and when converting code one portion at a time.
convert to Unicode only where it‘s absolutely ne- C++-specific typedefs, and the use of C++ fea-
cessary. Because only the VCL portion of tures such as function overloading, can fill in the
C++Builder requires the use of Unicode, your C gaps left by tchar.h in writing code that compiles
and C++ string manipulation and Windows API as either ANSI or Unicode.
interaction can continue to use char and
std::string. Even the portions of your code that
The C++ concept known as ―shims‖ (as described
interact with the VCL can often get away with in Matthew Wilson‘s Imperfect C++ [1] and as used
continuing to use ANSI strings, thanks to the im- in the STLsoft library [2]), combined with the use
plicit conversions between AnsiString and Un- of C++ templates, can make it simple to write ge-
icodeString that the C++ VCL provides (de-
neric code that works with both AnsiString and
UnicodeString (and C-style strings, and
scribed in more detail below).
std::string, and anything else you care to sup-
Of course, these two approaches aren‘t mutually ex- port).
clusive. You can do an initial, ―quick-and-dirty‖ mi- The use of variadic functions such as printf()
gration using the minimal approach, and then gradu- and sprint() presents a special challenge for mi-
ally implement the complete approach for the por- grating to Unicode, since the compiler is unable to
tions of your application that would most benefit from catch ANSI-versus-Unicode issues with these.
Standalone scripts can be used to transform these Listing 1: Converting to and from Unicode
variadic function calls into a format that the com-
// Sample C++ functions for doing
piler can check and then revert them to their nor-
// Unicode<->ANSI conversions using the
mal format after all issues are addressed. // Windows API. Note the use of
// boost::scoped_array to dynamically
// allocate memory and automatically clean
Converting text to and from Unicode // it up once we‟re done.
Before any migration can proceed, you need to know
std::wstring AnsiToUnicode(const char *s)
how to convert between the various Unicode encod- {
ings and the various ANSI encodings. The two easiest DWORD size = MultiByteToWideChar(CP_ACP,
ways are using the Windows API and using the VCL. 0, s, -1, NULL, 0);
The relevant Windows API functions are Wide- if (size == 0) {
return std::wstring();
CharToMultiByte() [3], which, despite its name, }
converts from UTF-16 to the encoding of your choice boost::scoped_array<wchar_t> buffer(
(UTF-8 or any of the various ANSI encodings); and new wchar_t[size]);
MultiByteToWideChar(CP_ACP, 0, s, -1,
MultiByteToWideChar() [4], which converts from the buffer.get(), size);
encoding of your choice to UTF-16. MSDN has full return std::wstring(buffer.get());
documentation on using these functions. }
Converting using the VCL is even easier. The VCL
std::string UnicodeToAnsi(const wchar_t *s)
provides C++ conversion constructors – constructors {
that can be called with only a single argument – so DWORD size = WideCharToMultiByte(CP_ACP,
that you can construct a UnicodeString from an An- 0, s, -1, NULL, 0, NULL, NULL);
if (size == 0) {
siString or UTF8String, or vice versa. Because C++ return std::string();
conversion constructors are implicitly invoked as }
needed, this also lets you provide an AnsiString boost::scoped_array<char> buffer(
wherever a UnicodeString is needed, or vice versa. new char[size]);
WideCharToMultiByte(CP_ACP, 0, s, -1,
(For example, this lets you assign a UnicodeString to buffer.get(), size, NULL, NULL);
an AnsiString.) See Listing 1 for example code. return std::string(buffer.get());
The ease with which conversions can be done in }
the VCL can have drawbacks. Because the assignment __fastcall TForm1::TForm1(TComponent* Owner)
operators look just like regular assignment and the : TForm(Owner)
conversion constructors can be implicitly invoked, {
your code may be converting between ANSI and UTF- // Implicit Unicode-to-ANSI conversion:
AnsiString s1 = L"Hello, world!";
16 without your even being aware of it. This can add // Implicit ANSI-to-Unicode conversion:
runtime overhead, but more importantly, it can result UnicodeString s2 = "Hello, world!";
in loss of data when converting from UTF-16 to an
// This works without modification in
ANSI encoding that cannot represent all of the Un- // C++Builder 2009, even though Caption
icode characters. Delphi includes a compiler warning // is Unicode and s1 is ANSI.
when this happens (―W1058: Implicit string cast with Label1->Caption = s1;
potential data loss from ‗string‘ to ‗AnsiString.‘‖), but // An implicit conversion from Unicode to
// ANSI. NOTE: This could lose data.
C++Builder will silently accept it. s1 = Label2->Caption;
Ideally there would be an option to have the // Identical to the above, but explicit.
C++Builder compiler emit a warning any time these s1 = AnsiString(Label2->Caption);
ANSI-Unicode conversions are implicitly invoked, but // We can also use a temporary AnsiString
as far as I can tell, no such option exists. If having // or UnicodeString to do Unicode<->ANSI
these functions implicitly invoked is a concern for // conversions.
you, then the only solution is to modify C++Builder‘s MessageBoxA(Handle,
AnsiString(Label1->Caption).c_str(),
header files. "Demo", MB_OK);
To do this, open the file ―include\vcl\dstring.h‖ }
and find the following lines:
takes two arguments (a message and a caption), inline const wchar_t *c_str_ptr_w(
and so a TApplication::MessageBox-style func- const UnicodeString& s)
{
tion would need four overloads (ANSI message return s.c_str();
and caption; Unicode message and caption; ANSI }
message and Unicode caption; Unicode message
inline const wchar_t *c_str_ptr_w(
and ANSI caption). const std::wstring& s)
{
And this is only for a bare bones TApplica- return s.c_str();
tion::MessageBox-style function. Most other VCL }
functions take Strings, not wchar_t*, as parameters;
it would be convenient if we had overloads to do the So far we‘re following the practice described by Mat-
same for our hypothetical MessageBox() replacement, thew Wilson in [6] and [1] and implemented in the
but that adds even more overloads. It would be even STLSoft library [2]. We have a replacement for TAp-
more convenient if we could also support C++ Stan- plication::MessageBox() that we can switch to
dard Library types like std::string or COM-related with a simple search-and-replace (just replace ―Appli-
types like WideString or BSTR. The number of over- cation->MessageBox‖ with ―AppMessageBox‖) and
loads to require all of these combinations of parame- that take any of several types of Unicode arguments
ters for even a single function quickly becomes prohi- without excessive overloads or extra function calls.
bitive. Clearly, a better approach is needed. For the purpose of quickly migrating to Unicode,
The concept of shims, as promoted by C++ author however, it‘s useful to have a TApplica-
and developer Matthew Wilson, offers a solution. tion::MessageBox() replacement that can also take
Shims ―are small, lightweight (in most all cases hav- ANSI arguments. In his article on shims and in his
ing zero runtime cost) components that help types ‗fit work on the STLSoft library, Matthew Wilson explicit-
or align‘ into algorithms and client code‖ [6]. For ex- ly avoids providing shims that convert between ANSI
ample, suppose you had a function that, if given any and Unicode, since those introduce (in his words)
string-like object, gave you a pointer to a C-style ―semi-implicit‖ conversion operations that introduce
string. (Since this function gets a pointer to a wide C- a performance penalty and violate the expectation
style string, and following the convention of Matthew that shims be lightweight. However, as part of a
Wilson‘s STLSoft library, we‘ll call this function C++Builder 2009 or 2010 Unicode migration, it‘s more
c_str_ptr_w().) Then you could write the following useful to accept a (possibly negligible) performance
TApplication::MessageBox() replacement: penalty in order to complete the initial migration as
soon as possible, then address performance and
template <typename T1, typename T2> ―proper‖ Unicode handling as needed.
int AppMessageBox(const T1& Text, Therefore, we need to provide c_str_ptr_w
const T2& Caption, int Flags = MB_OK)
{ shims that take ANSI arguments (const char *, An-
return Application->MessageBox( siString, and UnicodeString). This is harder than
c_str_ptr_w(Text), the previous cases. Our code will have to take the fol-
c_str_ptr_w(Caption), Flags); lowing approach:
}
We need to somehow provide a const wchar_t
Now we simply need to make sure that pointer. We can‘t simply return a const
c_str_ptr_w() results in a valid argument for every wchar_t* from c_str_ptr_w(), because we
parameter type that we use for AppMessageBox(). need to allocate memory to store the results of
The first few parameter types are easy: the ANSI-to-Unicode conversion, and returning
a raw pointer to that allocated memory would
inline const wchar_t *c_str_ptr_w(
const wchar_t *s) constitute a memory leak.
{ We can, however, define a class that contains the
return s;
}
allocated memory and return a copy of (not a ref-
erence to nor a pointer to) that class. The C++
language guarantees that it will properly clean
Reading and writing external data now requires to manipulate narrow character text.
attention both to in-memory storage (RawByteString, 6. Use string shims, C++ overloading, and similar
AnsiString, or UTF8String) and to encodings. (Using techniques as needed to handle remaining ANSI
the system encoding default ANSI encoding may lose versus Unicode issues.
data when transferring from UTF-16. UTF-8 is often
7. Run the type-safe printf() transformer on your
preferable.) Code that writes external data also needs
code to catch any issues with variadic macros.
to consider writing a Byte Order Mark (BOM), a spe-
cial sequence of bytes at the beginning of a file that 8. Review your code for places where you assume
indicates the file‘s endianness and Unicode encoding. that strings can be arbitrarily indexed or split; this
Finally, database tools and database interactions is no longer the case with Unicode.
may require additional attention, depending on your
If you‘re doing a minimal migration:
database‘s capabilities.
Some of these issues are discussed in more detail 1. Check your third-party libraries and make sure
in [9]. that they‘re compatible with C++Builder 2009 and
2010.
Putting it all together 2. Convert your project to C++Builder 2009 or 2010.
Under Project, Options, Directories and Condi-
Unicode is a very broad topic, and even the sub-topic
tionals, make sure that ―_TCHAR maps to‖ is set
of migrating to Unicode for C++Builder 2009 and 2010
to ―char.‖
touches upon many techniques. As a review, here‘s an
overview of one approach to migrating your applica- 3. Replace String with AnsiString.
tion to C++Builder 2009 or 2010: 4. Use string shims, C++ overloading, and similar
First, decide on whether you‘re going to do a techniques to handle interactions between Un-
complete migration (to gain the full benefits of Un- icode VCL code and your ANSI application code.
icode) or a minimal migration (to get up and running 5. Run the type-safe printf() transformer on your
in the new IDE as soon as possible). code to catch any issues with variadic macros.
If you‘re doing a complete migration:
Gradually switch to Unicode, as time and business
1. Check your third-party libraries and make sure cases permit, to gain the full benefits of Unicode.
that they‘re compatible with C++Builder 2009 and
2010.
2. Before switching to C++Builder 2009 or 2010: Contact Josh at joshkel@gmail.com.
a. Replace AnsiString with String.
b. Mark string literals (“Hello”) with tchar.h‘s _T
macro.
References
1. Matthew Wilson, Imperfect C++. Addison-Wesley,
c. Replace C library routines with their tchar.h
equivalents. 2004.
2. Matthew Wilson et. al. ―STLSoft – Robust,
3. Add C++ typedefs such as tstring so that C++
Lightweight, Cross-platform, Template Software.‖
string manipulation will work after the switch to
http://www.stlsoft.org/.
Unicode.
3. WideCharToMultiByte.
4. Convert your project to C++Builder 2009 or 2010.
http://msdn.microsoft.com/en-
Under Project, Options, Directories and Condi-
us/library/dd374130%28VS.85%29.aspx.
tionals, make sure that ―_TCHAR maps to‖ is set
to ―wchar_t.‖ 4. MultiByteToWideChar.
http://msdn.microsoft.com/en-
5. Introduce AnsiString, RawByteString, and
us/library/dd319072%28VS.85%29.aspx.
UTF8String in places where you need to continue