//
// Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
//
// Distributed under the Boost Software License, Version 1.0. (See
// accompanying file LICENSE_1_0.txt or copy at
// http://www.boost.org/LICENSE_1_0.txt)
//
/*!
\mainpage Boost.Nowide
Table of Contents:
- \ref main
- \ref main_rationale
- \ref main_the_problem
- \ref main_the_solution
- \ref main_wide
- \ref main_reading
- \ref using
- \ref using_standard
- \ref using_custom
- \ref using_integration
- \ref technical
- \ref technical_imple
- \ref technical_cio
- \ref qna
- \ref standalone_version
- \ref sources
\section main What is Boost.Nowide
Boost.Nowide is a library implemented by Artyom Beilis
that makes cross platform Unicode aware programming
easier.
The library provides an implementation of standard C and C++ library
functions, such that their inputs are UTF-8 aware on Windows without
requiring to use Wide API.
\section main_rationale Rationale
\subsection main_the_problem The Problem
Consider a simple application that splits a big file into chunks, such that
they can be sent by e-mail. It requires doing a few very simple tasks:
- Access command line arguments: int main(int argc,char **argv)
- Open an input file, open several output files: std::fstream::open(char const *,std::ios::openmode m)
- Remove the files in case of fault: std::remove(char const *file)
- Print a progress report onto the console: std::cout << file_name
Unfortunately it is impossible to implement this simple task in plain C++
if the file names contain non-ASCII characters.
The simple program that uses the API would work on the systems that use UTF-8
internally -- the vast majority of Unix-Line operating systems: Linux, Mac OS X,
Solaris, BSD. But it would fail on files like War and Peace - Война и мир - מלחמה ושלום.zip
under Microsoft Windows because the native Windows Unicode aware API is Wide-API -- UTF-16.
This incredibly trivial task is very hard to implement in a cross platform manner.
\subsection main_the_solution The Solution
Boost.Nowide provides a set of standard library functions that are UTF-8 aware and
makes Unicode aware programming easier.
The library provides:
- Easy to use functions for converting UTF-8 to/from UTF-16
- A class to make the \c argc, \c argc and \c env parameters of \c main use UTF-8
- UTF-8 aware functions
- \c stdio.h functions:
- \c fopen
- \c freopen
- \c remove
- \c rename
- \c stdlib.h functions:
- \c system
- \c getenv
- \c setenv
- \c unsetenv
- \c putenv
- \c fstream
- \c filebuf
- \c fstream/ofstream/ifstream
- \c iostream
- \c cout
- \c cerr
- \c clog
- \c cin
\subsection main_wide Why Not Narrow and Wide?
Why not provide both Wide and Narrow implementations so the
developer can choose to use Wide characters on Unix-like platforms?
Several reasons:
- \c wchar_t is not really portable, it can be 2 bytes, 4 bytes or even 1 byte making Unicode aware programming harder
- The C and C++ standard libraries use narrow strings for OS interactions. This library follows the same general rule. There is
no such thing as fopen(wchar_t const *,wchar_t const *)
in the standard library, so it is better
to stick to the standards rather than re-implement Wide API in "Microsoft Windows Style"
\subsection main_reading Further Reading
- www.utf8everywhere.org
- Windows console i/o approaches
\section using Using The Library
\subsection using_standard Standard Features
The library is mostly header only, only console I/O requires separate compilation under Windows.
As a developer you are expected to use \c boost::nowide functions instead of the functions available in the
\c std namespace.
For example, here is a Unicode unaware implementation of a line counter:
\code
#include
#include
int main(int argc,char **argv)
{
if(argc!=2) {
std::cerr << "Usage: file_name" << std::endl;
return 1;
}
std::ifstream f(argv[1]);
if(!f) {
std::cerr << "Can't open " << argv[1] << std::endl;
return 1;
}
int total_lines = 0;
while(f) {
if(f.get() == '\n')
total_lines++;
}
f.close();
std::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl;
return 0;
}
\endcode
To make this program handle Unicode properly, we do the following changes:
\code
#include
#include
#include
int main(int argc,char **argv)
{
boost::nowide::args a(argc,argv); // Fix arguments - make them UTF-8
if(argc!=2) {
boost::nowide::cerr << "Usage: file_name" << std::endl; // Unicode aware console
return 1;
}
boost::nowide::ifstream f(argv[1]); // argv[1] - is UTF-8
if(!f) {
// the console can display UTF-8
boost::nowide::cerr << "Can't open " << argv[1] << std::endl;
return 1;
}
int total_lines = 0;
while(f) {
if(f.get() == '\n')
total_lines++;
}
f.close();
// the console can display UTF-8
boost::nowide::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl;
return 0;
}
\endcode
This very simple and straightforward approach helps writing Unicode aware programs.
\subsection using_custom Custom API
Of course, this simple set of functions does not cover all needs. If you need
to access Wide API from a Windows application that uses UTF-8 internally you can use
functions like \c boost::nowide::widen and \c boost::nowide::narrow.
For example:
\code
CopyFileW( boost::nowide::widen(existing_file).c_str(),
boost::nowide::widen(new_file).c_str(),
TRUE);
\endcode
The conversion is done at the last stage, and you continue using UTF-8
strings everywhere else. You only switch to the Wide API at glue points.
\c boost::nowide::widen returns \c std::string. Sometimes
it is useful to prevent allocation and use on-stack buffers
instead. Boost.Nowide provides the \c boost::nowide::basic_stackstring
class for this purpose.
The example above could be rewritten as:
\code
boost::nowide::basic_stackstring wexisting_file,wnew_file;
if(!wexisting_file.convert(existing_file) || !wnew_file.convert(new_file)) {
// invalid UTF-8
return -1;
}
CopyFileW(wexisting_file.c_str(),wnew_file.c_str(),TRUE);
\endcode
\note There are a few convenience typedefs: \c stackstring and \c wstackstring using
256-character buffers, and \c short_stackstring and \c wshort_stackstring using 16-character
buffers. If the string is longer, they fall back to memory allocation.
\subsection using_windows_h The windows.h header
The library does not include the \c windows.h in order to prevent namespace pollution with numerous
defines and types. Instead, the library defines the prototypes of the Win32 API functions.
However, you may request to use the \c windows.h header by defining \c BOOST_NOWIDE_USE_WINDOWS_H
before including any of the Boost.Nowide headers
\subsection using_integration Integration with Boost.Filesystem
Boost.Filesystem supports selection of narrow encoding. Unfortunatelly the default narrow encoding on Windows isn't UTF-8, you can enable UTF-8 as default encoding on Boost.Filesystem
by calling `boost::nowide::nowide_filesystem()` in the beginning of your program
\section technical Technical Details
\subsection technical_imple Windows vs POSIX
For Microsoft Windows, the library provides UTF-8 aware variants of some \c std:: functions in the \c boost::nowide namespace.
For example, \c std::fopen becomes \c boost::nowide::fopen.
Under POSIX platforms, the functions in boost::nowide are aliases of their standard library counterparts:
\code
namespace boost {
namespace nowide {
#ifdef BOOST_WINDOWS
inline FILE *fopen(char const *name,char const *mode)
{
...
}
#else
using std::fopen
#endif
} // nowide
} // boost
\endcode
\subsection technical_cio Console I/O
Console I/O is implemented as a wrapper around ReadConsoleW/WriteConsoleW
(used when the stream goes to the "real" console) and ReadFile/WriteFile
(used when the stream was piped/redirected).
This approach eliminates a need of manual code page handling. If TrueType
fonts are used the Unicode aware input and output works as intended.
\section qna Q & A
Q: Why doesn't the library convert the string to/from the locale's encoding (instead of UTF-8) on POSIX systems?
A: It is inherently incorrect
to convert strings to/from locale encodings on POSIX platforms.
You can create a file named "\xFF\xFF.txt" (invalid UTF-8), remove it, pass its name as a parameter to a program
and it would work whether the current locale is UTF-8 or not.
Also, changing the locale from let's say \c en_US.UTF-8 to \c en_US.ISO-8859-1 would not magically change all
files in the OS or the strings a user may pass to the program (which is different on Windows)
POSIX OSs treat strings as \c NULL terminated cookies.
So altering their content according to the locale would
actually lead to incorrect behavior.
For example, this is a naive implementation of a standard program "rm"
\code
#include
int main(int argc,char **argv)
{
for(int i=1;i
#include
#include
int main(int argc,char **argv)
{
nowide::args a(argc,argv); // Fix arguments - make them UTF-8
if(argc!=2) {
nowide::cerr << "Usage: file_name" << std::endl; // Unicode aware console
return 1;
}
nowide::ifstream f(argv[1]); // argv[1] - is UTF-8
if(!f) {
// the console can display UTF-8
nowide::cerr << "Can't open a file " << argv[1] << std::endl;
return 1;
}
int total_lines = 0;
while(f) {
if(f.get() == '\n')
total_lines++;
}
f.close();
// the console can display UTF-8
nowide::cout << "File " << argv[1] << " has " << total_lines << " lines" << std::endl;
return 0;
}
\endcode
\endcode
\subsection sources Sources and Downloads
The upstream sources can be found at GitHub: https://github.com/artyom-beilis/nowide
You can download the latest sources there:
- Standard Version: nowide-master.zip
- Standalone Boost independent version nowide_standalone.zip
*/
// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4 filetype=cpp.doxygen