Lesson 2: The API is based on Unicode <

Lesson 2: The API is based on Unicode < December 13, 2011

Contents of this lesson:
> Unicode versus ANSI
> Why UNICODE should be defined in the source code
> How to include the [windows.h] header file in C++

This lesson’s main points:

The Windows API is based on UTF-16 encoded text, called “Unicode”.
In C++ that means using the wchar_t type, & friends.
Microsoft’s T…-macros support Windows 9x, if you want that.

> Unicode versus ANSI

About the simplest GUI application you can make is one that just presents a message box, consisting of a single function call in main. Most every GUI framework offers a message box function, and so does of course also the Windows API. The official online documentation of Windows API’ MessageBox shows its signature as …

int WINAPI MessageBox(
  __in_opt  HWND hWnd,
  __in_opt  LPCTSTR lpText,
  __in_opt  LPCTSTR lpCaption,
  __in      UINT uType
);

WINAPI is a macro that originally depended on the platform, and that expands to a non-standard specification of the machine code level calling convention. __in_opt and __in are macros that expand to nothing, but that purportedly aid certain C level tools and programming techniques. I write “purportedly” because they’re frequently wrong, as they are above, and then it’s difficult to see how they can really help with anything. This visual noise is called header annotations or SAL annotations, where SAL is short for “Standard Annotation Language”, Microsoft’s pretentious name for this “language”. They’re best ignored.

Ignoring the macro-based visual noise, then, and also ignoring the WINAPI, the declaration becomes …

int MessageBox(
  HWND hWnd,
  LPCTSTR lpText,
  LPCTSTR lpCaption,
  UINT uType
);

which looks pretty much like a standard C++ function declaration!

However, there’s more macro-based obfuscation here. Both MessageBox and LPCTSTR are names that depend on whether the UNICODE symbol is defined when the [windows.h] header is included. MessageBox is directly a macro, as is almost every Windows API function name!, while LPCTSTR is defined in terms of a conditionally compiled definition:

Symbol:	`UNICODE` defined:	`UNICODE` not defined:
`MessageBox`	`MessageBoxW`	`MessageBoxA`
`LPCTSTR`	`wchar_t const*`	`char const*`

So depending on whether UNICODE is defined when [windows.h] is included, the documentation’s single declaration expands to one of the two declarations …

“Unicode” version:	“ANSI” version:
int MessageBoxW( HWND hWnd, wchar_t const* lpText, wchar_t const* lpCaption, UINT uType );	int MessageBoxA( HWND hWnd, char const* lpText, char const* lpCaption, UINT uType );

Both versions are provided by the [user32.dll] DLL, and both versions can be called in the same program. The wchar_t based MessageBoxW version is called the Unicode version, because the strings are encoded as UTF-16 (a 16-bit encoding of Unicode). From which you can infer that the wchar_t type in Windows is 16 bits, guaranteed. The Unicode version is the basic version, and the char based MessageBoxA version is just a wrapper that translates the strings from Windows ANSI encoding, to UTF-16, and calls MessageBoxW. Most Unicode based API functions (but not all) have such an ANSI wrapper.

OK, let’s try this, with this source code:

#undef	UNICODE
#define	UNICODE
#undef	STRICT
#define	STRICT
#include <windows.h>

int main()
{
	DWORD const	infoboxOptions	= MB_OK | MB_ICONINFORMATION | MB_SETFOREGROUND;

	char const* const		narrowText	= "It's a 日本国 кошка, LOL!";
	wchar_t const* const	wideText	= L"It's a 日本国 кошка, LOL!";

	MessageBox( 0, wideText, L"Unicode (wide) text:", infoboxOptions );
	MessageBoxA( 0, narrowText, "ANSI (narrow) text:", infoboxOptions );
}

Detailed instructions:

TODO #1:

Run Visual C++ Express.

Bring up the New Project dialog, e.g. via the menus [File → New → Project…].

Choose a Win32 Project:

Type in a project name like “unicode_versus_ansi”.

Click the OK button, but DO NOT click Finish yet.

In the Application Wizard, choose Application Settings (it’s up at the left):

Still in the Application Wizard, tick off that you want an Empty Project. This also prevents precompiled headers (non-standard preprocessor behavior), even though that check box is disabled and has a tick mark in it. The dialog has a bug which prevents you from changing the precompiled header choice as long as the selection is “Windows Application”, and which fails to remove the check mark, but it works in spite of this bug:

Click the Finish button in the Application Wizard; the new project is created.

To add a main source code file: right-click the project name or e.g. the Source Files project folder and [Add → New Item…], choose type C++ File (.cpp), name it e.g. [main.cpp] (without the square brackets), and add the source code shown above, including the Russian and Chinese characters (you can just copy and paste them):

Save, even if you have red underlinings as shown above (it’s just Microsoft’s “Intellisense”™ code completion feature going amok). Most probably a dialog will now pop up asking whether you want to save as Unicode, as shown below, due to the Russian and Chinese characters. Just click Yes in order to save as UTF-8 with BOM:

Build. You should get a number of warnings about characters that cannot be represented, and unless you have already fixed the project settings you should get a linking error about WinMain, which means, the silly beast failed to recognize the standard main function.

Fix the project settings to support standard main, by setting entry point mainCRTStartup, as in the first lesson.

Build and run:
→

As you can see in the second message box, the char based literal ended up as a Windows ANSI string value in the executable, with question marks for the characters that could not be represented, in spite of the UTF-8 encoded C++ source code. That’s because Visual C++ has Windows ANSI as its (as of this writing still undocumented) C++ execution character set. In order to deal correctly with e.g. international file names, it’s a good idea to base modern Windows programs entirely on Unicode; it’s also more efficient.

> Why `UNICODE` should be defined in the source code

The UNICODE-dependent declarations in [windows.h] were originally meant to allow you to compile the same code “as Unicode” (i.e. using Unicode functions and wchar_t based strings) and “as ANSI” (i.e. using ANSI functions and char based strings). For example, the same source that would be compiled as Unicode for Windows NT 4.0 and Windows 2000, could be compiled as ANSI for Windows 9x, since Windows 9x was a kind of hybrid Windows without the Unicode API. The Windows 9x version would then have reduced functionality, but it would at least exist!

For that target-switching usage it makes sense to have UNICODE defined – or not – in the project settings, instead of in the code.

But how well does that practice hold up, for modern programming as of 2011? Not very well. For it is very difficult to avoid introducing direct wchar_t dependencies, so that without UNICODE defined the compilation simply fails, as illustrated below:

TODO #2:

Open the Visual C++ Express solution from lesson 1 (probably named “default_gui_project”).

In the project settings dialog, drill down to [Configuration Properties → General], and click in the edit field for the Character Set value:

In the drop down list for the field, select e.g. Use Multi-Byte Character Set, which does not make UNICODE defined.

Click OK in the project settings dialog.

Build, using e.g. keypress [F7] or via the menus [Build → Build Solution], and note that (if your code is as decribed in lesson 1) you now get a compilation error!

1>d:\winfolders\alf\my documents\visual studio 2010\projects\default_gui_project\default_gui_project.cpp(165): error C2664: 'DrawTextA' : cannot convert parameter 2 from 'const wchar_t *' to 'LPCSTR'
1>          Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast

DrawTextA is the ANSI version of DrawText, selected because UNICODE is now not defined. And DrawTextA is not very happy with a wchar_t based argument! So, even for that minimal program a dependency on wchar_t had sneaked in, and the simply-compile-with-or-without-UNICODE-defined scheme no longer worked!

Point: defining the UNICODE symbol, or not, via the project settings, will in general just introduce a way to make compilation fail, like above. Perhaps it will fail for someone who inadvertently changes this setting. Or perhaps it will fail for someone building the software without the Visual C++ Express solution, for example building it with compilation from the command line. The UNICODE symbol should therefore instead be defined in the source code. For example, you can create a wrapper for [windows.h] that defines UNICODE before including [windows.h].

> How to include the [windows.h] header file in C++

The previous section showed that a simple include of [windows.h], like …

#include <windows.h>

is not good enough. One should as a minimum define the UNICODE symbol before the include, like …

#define UNICODE
#include <windows.h>

But then the compiler may issue some silly-warning about UNICODE already having been defined (e.g. via project settings), so that one should ideally either put that definition within an #ifndef, or more concisely just undefine the symbol before defining it:

#undef  UNICODE
#define UNICODE
#include <windows.h>

For consistency one should then also define the runtime library’s corresponding symbol _UNICODE in the same way. The UNICODE symbol that we’ve so far dealt with, affects the declarations that you get from the Windows API [windows.h] header, while the _UNICODE symbol affects the declarations that you get from the (non-standard) [tchar.h] runtime library header. Then the including code can look like this:

#undef  _UNICODE
#define _UNICODE        // For [tchar.h]
#undef  UNICODE
#define UNICODE         // For [windows.h]
#include <windows.h>

But hey, the [tchar.h] header depends not only on _UNICODE but also on a symbol _MBCS which, if defined, says that declarations suitable for a non-Unicode “multi-byte character set” is desired. For example, the Use Multi-Byte Character Set setting chosen in the previous section, defines _MBCS. Since _MBCS is in direct conflict with _UNICODE it’s best to detect it and err out as early as possible, instead of letting the compilation go blithely on to some arbitrary later error:

#ifdef  _MBCS
#   error "_MBCS (multi-byte character set) is defined, but only Unicode is supported"
#endif
#undef  _UNICODE
#define _UNICODE        // For [tchar.h]
#undef  UNICODE
#define UNICODE         // For [windows.h]
#include <windows.h>

That takes care of the Windows API and runtime library declarations that depend on the chosen character encoding. Well except the choice of startup function, which for standards-compliance should just be main regardless of the chosen encoding. And the encoding should be UTF-16 Unicode anyway… 🙂

In addition to UNICODE C++ code should best define the NOMINMAX symbol before including [windows.h]. Otherwise two lowercase macros min and max will be defined, and will then wreak havoc with code using these names from the standard C++ library. In the 1990’s it was also generally necessary to define STRICT before including [windows.h] in C++ code, so as to get type safe declarations that could be used from C++, but somewhere on the road STRICT definitions became the default. Still, it does not hurt to define also STRICT. A minimal general inclusion of [windows.h] for C++ can therefore look like this:

#ifdef  _MBCS
#   error "_MBCS (multi-byte character set) is defined, but only Unicode is supported"
#endif
#undef  _UNICODE
#define _UNICODE        // For [tchar.h]
#undef  UNICODE
#define UNICODE         // For [windows.h]

#undef  NOMINMAX
#define NOMINMAX        // C++ standard library compatibility
#undef  STRICT
#define STRICT          // C++ type-checking compatibility

#include <windows.h>

To be practically useful the assumed minimum version of Windows should also be specified. This serves both to include declarations of functions introduced with that version, and to exclude declarations of functions introduced after that version. The main problem is that there are a host of different preprocessor symbols used for this version selection: WINVER, _WIN32_WINNT, _WIN32_WINDOWS, _WIN32_IE, and as of this writing the latest most shiny and new one, NTDDI_VERSION…

The documentation is unfortunately rather vague & misleading, but happily the generated code for the 1^st lesson’s program contained this clear advice, in the [targetver.h] file:

“Including [SDKDDKVer.h] defines the highest available Windows platform. If you wish to build your application for a previous Windows platform, include [WinSDKVer.h] and set the _WIN32_WINNT macro to the platform you wish to support before including [SDKDDKVer.h].”

Relevant possible values for _WIN32_WINNT include _WIN32_WINNT_WIN2K, _WIN32_WINNT_WINXP, _WIN32_WINNT_WS03, _WIN32_WINNT_VISTA, _WIN32_WINNT_WS08 and _WIN32_WINNT_WIN7, where e.g. _WIN32_WINNT_WIN2K is defined as 0x0500, signifying Windows NT version 5.00. As I’m writing this many people are still running Windows XP, but in just half a year or so (mid 2012) I expect nearly all those people and institutions to have upgraded, with everybody then running either Windows 7 or Windows 8. Or Mac or Linux.

#ifdef  _MBCS
#   error "_MBCS (multi-byte character set) is defined, but only Unicode is supported"
#endif
#undef  _UNICODE
#define _UNICODE        // For [tchar.h]
#undef  UNICODE
#define UNICODE         // For [windows.h]

#undef  NOMINMAX
#define NOMINMAX        // C++ standard library compatibility
#undef  STRICT
#define STRICT          // C++ type-checking compatibility

#ifndef _WIN32_WINNT
#   define _WIN32_WINNT     _WIN32_WINNT_WINXP  // Minimum version as per 2011/2012.
#endif
#include <SDKDDKVer.h>
#include <windows.h>

Finally, [windows.h] will by default include a lot of headers that one will normally not need. It’s easy to explicitly include such a header where it’s actually needed, but it’s not easy to make sure that code that needs it is including it, if [windows.h] is also including it. Also, all these headers just may affect build times negatively, to some extent, and they may wreak havoc with your code.

To avoid all that, just define WIN32_LEAN_AND_MEAN before including [windows.h].

OK, it’s not really as easy as “just” indicates. If you do define WIN32_LEAN_AND_MEAN then you might get in trouble with e.g. the GDI+ headers, but if you don’t define WIN32_LEAN_AND_MEAN then you might get in trouble with e.g. the Winsock v2 headers. For that matter, just defining NOMINMAX, which is necessary to avoid trouble with the standard C++ library, can land you in trouble with the GDI+ headers. Given that, it’s probably best to just ignore the issues with low quality headers. Then the including code can look like this:

#pragma once

#ifdef  _MBCS
#   error "_MBCS (multi-byte character set) is defined, but only Unicode is supported"
#endif
#undef  _UNICODE
#define _UNICODE        // For [tchar.h]
#undef  UNICODE
#define UNICODE         // For [windows.h]

#undef  NOMINMAX
#define NOMINMAX        // C++ standard library compatibility
#undef  STRICT
#define STRICT          // C++ type-checking compatibility

#ifndef _WIN32_WINNT
#   define _WIN32_WINNT     _WIN32_WINNT_WINXP  // Minimum version as per 2011/2012.
#endif
#ifdef  _MSC_VER            // Visual C++ only (the SDK is Visual C++ specific).
#   include <SDKDDKVer.h>   // E.g. g++ 4.4.1 does not support this file/scheme.
#endif

#undef  WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#include <windows.h>

This is, of course, best put in a reusable wrapper! 🙂

TODO #3:

Define a reusable C++ wrapper for [windows.h], and test it in e.g. a program that displays a message box.

Acknowledgments:

@DeadMG over at Stack Overflow encouraged me to not just ignore WINAPI. At first I decided to show what it expanded to. But then I ended up adding just “and also ignoring the WINAPI”… 🙂

Cheers, & enjoy!,
– Alf

Comments (7)

7 Responses to “Lesson 2: The API is based on Unicode <”

Daniel Says:
December 13, 2011 at 7:56 pm
Thank you for your lessons. Up to now I have only done MFC-GUI-programming and I hope your lessons will help me in understanding “Naked”-API-GUI-programming for Windows a bit better.

Reply
Daron Bedre Says:
December 23, 2011 at 5:12 pm
Hey There. I discovered your blog the use of msn. This is a very smartly written article. I’ll make sure to bookmark it and return to read extra of your helpful info. Thanks for the post. I’ll certainly return.

Reply
Alf P. Steinbach Says:
January 9, 2012 at 1:33 pm
ERROR:
I discovered that the MinGW g++ 4.4.1 compiler does not support the [SDKDDKVer.h] header. The Microsoft SDK is after all written for Visual C++. So I now made the inclusion of that header conditional on _MSC_VER, i.e. only when the compiler is Visual C++.

Reply
MFC Tips Says:
January 24, 2012 at 2:43 am
You said:
“…But how well does that practice hold up, for modern programming as of 2011? Not very well. For it is very difficult to avoid introducing direct wchar_t dependencies, so that without UNICODE defined the compilation simply fails, as illustrated below…”
and
“…DrawTextA is the ANSI version of DrawText, selected because UNICODE is now not defined. And DrawTextA is not very happy with a wchar_t based argument! So, even for that minimal program a dependency on wchar_t had sneaked in, and the simply-compile-with-or-without-UNICODE-defined scheme no longer worked!…”

Your statement is only correct, in that it wouldn’t work, because your sample code describes exactly how it should *NOT* be done. Your sample code, as you have mentioned, only works with UNICODE builds. Thats because you’re supposed to use TCHAR and TEXT macros so that the same code can compile properly regardless if the project is ansi or unicode. I understand you were trying to make a point, but your statement is saying that it wont work, and you provide sample code which supports this, but your sample code is showing the worst possible example of how to use ansi/unicode specific strings. However, nowhere in your article do I see it mentioned that it could very easily work if you were to do things the right way.

The proper way of using strings for ANSI and/or UNICODE:
Use TCHAR instead of char or wchar_t
Use TEXT macro (or _T) for all strings, instead of specifically of L prefix or (no prefix) for ANSI string.

TCHAR const* const myText = _T(“This works, LOL”);
MessageBox( NULL, myText, _T(“ANSI and/or UNICODE text”), MB_OK );

The above code will compile properly regardless if the project is set to ansi or unicode. Technically, this is how strings and api should always be used.

Reply
- Alf P. Steinbach Says:
  January 25, 2012 at 2:02 pm
  For someone who wants to support Windows 9x while using MFC in DLLs, and unwilling to recompile MFC or not knowing about Microsoft’s Layer for Unicode, using the T macros is indeed a possible solution. Not “proper” as you write, but one possible solution for supporting that archaic technology.
  
  However, other than that it’s a lot of work and complication and sheer ugliness, for negative gain.
  
  For, exactly who constitute the market for crippled applications that don’t handle international characters? That’s what you get with T macros when you build for ANSI. I don’t think any modern users will be impressed by your app’s ability to run on Windows 9x – if you managed to tackle all the other problems involved in targeting Windows 9x…
  
  Cheers & hth.,
  
  – Alf
  
  PS: Upshot: do not use the Microsoft T macros. Since the year 2000, when Microsoft introduced the Layer for Unicode, they’ve been a very ugly solution looking for some problem to solve. And that problem has as of 2012 not yet surfaced.
  
  Reply
MFC Tips Says:
January 25, 2012 at 7:30 pm
My suggestion was only for developing code which would compile properly for both ansi and unicode builds.
Using TCHAR instead of char/wchar_t, and using T macro instead of L prefix/no prefix, is pretty easy, as a solution to this specific issue.
I was not recommending for anyone do it this way for “normal” (ansi-only or multibyte-only or unicode-only) development.
I have my reasons for occasionally developing for both ansi and unicode, and it has absolutely nothing to do with Windows 9x whatsoever.
If you have a more practical way to write code which would compile properly for both ansi and unicode, I would be very grateful to hear about it.
I agree with sheer ugliness, but unfortunately, I have yet to find a better way.
Thanks.

Reply
Tom Bates Says:
June 13, 2012 at 2:39 pm
I’m learning a lot from these tutorials. I have a bit of time to experiment with this right now while I’m waiting for several machines to finish their Windows updates. 🙂

Strangely, on this particular machine, the Asian characters display fine in the source code, but not in the message box, yet the Cyrillic (I think) characters display fine both in the source and in the message box.

I’ve been thinking for a while about supporting international characters in apps that I work on, and this has further motivated me. One thing I’d like to find out is how to eliminate the “universal-character-name” warnings. I supposed I could use a “#pragma” safely, but I wonder if it’s possible to change the “current codepage” to accommodate this characters properly.

Thanks again for all your efforts.

Reply

Lessons in Windows API programming (C++)