Skip to content
C++
Library
since C++98
Basic

Wide Strings

Character and string types for wide character sets, including wchar_t, std::wstring, and Unicode-aware character types for international text.

Wide Stringssince C++98

Strings using wide character types (wchar_t or Unicode-aware char8_t, char16_t, char32_t) to represent character sets that don't fit in a single byte, particularly for international scripts and Unicode text.

Overview

Wide strings provide support for character sets larger than ASCII, such as Japanese kanji, Cyrillic, and other international scripts. The fundamental wide character type is wchar_t, an implementation-defined integral type typically 2–4 bytes. The Standard Library defines std::wstring as a specialization of std::basic_string<wchar_t>.

Since C++11, character types with explicit Unicode encoding semantics have been added to address wchar_t's portability limitations:

  • char8_t (C++20): UTF-8; std::u8string
  • char16_t (C++11): UTF-16; std::u16string
  • char32_t (C++11): UTF-32; std::u32string

These types have guaranteed sizes across all platforms and clearly specify their encoding, making them preferred for portable Unicode handling. Legacy wchar_t varies: Windows typically uses 2 bytes; POSIX systems often use 4 bytes, leading to portability issues when storing Unicode characters outside the basic multilingual plane.

Syntax

Character and String Literals

Wide character literals and wide string literals use the L prefix:

cpp
wchar_t ch = L'Ω';           // wide character literal
std::wstring str = L"Москва"; // wide string literal

C++11 and later introduce prefixes for explicit Unicode encoding:

cpp
auto u8str = u8"hello";   // std::u8string (C++20: char8_t) or std::string (C++11-17)
auto u16str = u"Москва";  // std::u16string, char16_t
auto u32str = U"hello";   // std::u32string, char32_t

String Type Definitions

cpp
#include <string>

std::wstring ws;                          // wchar_t-based string
std::wstring ws2 = L"wide string";
std::u8string u8s = u8"UTF-8";           // C++20: char8_t; C++11-17: char
std::u16string u16s = u"UTF-16";         // char16_t (C++11)
std::u32string u32s = U"UTF-32";         // char32_t (C++11)

String Views

Since C++17, non-owning view types are available:

cpp
#include <string_view>

std::wstring_view wsv = L"wide";          // wchar_t
std::u16string_view u16sv = u"UTF-16";   // char16_t
std::u32string_view u32sv = U"UTF-32";   // char32_t
std::u8string_view u8sv = u8"UTF-8";     // C++20: char8_t

Examples

Basic Wide String Operations

cpp
#include <string>
#include <iostream>

int main() {
    std::wstring greeting = L"こんにちは"; // Japanese "hello"
    std::wstring name = L"世界";            // "world"
    
    std::wstring message = greeting + L" " + name;
    std::wcout << message << std::endl;
    
    return 0;
}

Numeric Conversions with Wide Strings

cpp
#include <string>
#include <iostream>

int main() {
    // to_wstring: convert numeric types to wide string (C++11)
    int value = 42;
    double pi = 3.14159;
    
    std::wstring wstr_int = std::to_wstring(value);    // L"42"
    std::wstring wstr_double = std::to_wstring(pi);    // L"3.141590"
    
    // Reverse: string to number (C++11)
    std::wstring wnum = L"123";
    int result = std::stoi(wnum);      // 123
    double dval = std::stod(L"3.14");  // 3.14
    
    return 0;
}

Wide String Streams

cpp
#include <sstream>
#include <iostream>

int main() {
    std::wstringstream wss;
    
    wss << L"Temperature: " << 25 << L"°C";
    
    std::wstring result = wss.str();
    std::wcout << result << std::endl;  // Wide character output
    
    return 0;
}

UTF-32 for Complete Unicode Coverage

cpp
#include <string>
#include <iostream>

int main() {
    // UTF-32 can represent any Unicode code point directly
    std::u32string emojis = U"🌍🌎🌏";  // All representable in single char32_t
    
    for (char32_t ch : emojis) {
        // Each iteration: one complete Unicode character
    }
    
    // Equivalent with wchar_t on some platforms may require surrogate pairs
    // or fail entirely, depending on platform and locale
    
    return 0;
}

String View for Zero-Copy Passing

cpp
#include <string_view>

void process_unicode(std::u32string_view sv) {
    // No allocation, no copy
    for (char32_t ch : sv) {
        // Process each code point
    }
}

int main() {
    std::u32string text = U"Hello";
    process_unicode(text);  // Implicit conversion to string_view
    
    return 0;
}

Best Practices

  1. Prefer Fixed-Size Unicode Types for New Code: Use std::u8string, std::u16string, or std::u32string instead of std::wstring. They are portable, explicitly specify encoding, and have guaranteed sizes. Reserve std::wstring only for APIs that mandate wchar_t (e.g., Windows Unicode APIs).

  2. Use String Views for Function Parameters: Pass wide strings as std::u32string_view or std::wstring_view (C++17+) to eliminate copy overhead and bind temporaries.

  3. Choose the Right Encoding: Use UTF-32 (std::u32string) for algorithms that iterate or index characters by code point; use UTF-8 (std::u8string, C++20) for storage and network transmission; use UTF-16 only when interfacing with platform APIs that require it.

  4. Be Explicit with Locale: Character classification functions (iswalpha, iswupper) are locale-dependent. For consistent results across platforms, use Unicode property algorithms from a dedicated library (e.g., ICU) or avoid implicit locale dependence.

  5. Handle Conversion Explicitly: When converting between encodings or narrow/wide, use explicit constructors or range algorithms. Never silently truncate or mix without conversion.

Common Pitfalls

  1. Assuming wchar_t Portability: The size and encoding of wchar_t are implementation-defined. Code that works on Windows (typically 2 bytes, UCS-2 semantics) may fail on Linux (typically 4 bytes, full Unicode). Use fixed-size types for portable code.

  2. Silent Character Loss: Assigning a narrow string to a wide string without explicit conversion truncates non-ASCII characters. Always use explicit constructors: std::wstring(str.begin(), str.end()).

  3. Surrogate Pair Handling: On systems where wchar_t is 2 bytes, characters outside the basic multilingual plane (e.g., emojis, rare scripts) require surrogate pairs. String indexing and iteration become unsafe. UTF-32 avoids this entirely.

  4. Incomplete Locale Support: Character classification and collation depend on the active locale. The same character may classify differently on different systems. Never assume classification is deterministic without fixing the locale.

  5. Performance Regression: Wide strings consume 2–4× memory compared to UTF-8. Operations like comparison and iteration are slower. Use wide strings only when necessary, and prefer UTF-8 for storage.

See Also