What is Text Encoding in C++?

C++ character types, encoding-aware string types, and conversion facilities for UTF-8, UTF-16, UTF-32, and locale-dependent encodings.

Which C++ standard introduced Text Encoding?

Text Encoding was introduced in C++11.

What is the difficulty level of Text Encoding?

Text Encoding is considered Advanced-level C++ material.

Text Encoding

Text Encodingsince C++11

C++ text encoding encompasses the character types, string classes, literal prefixes, and conversion facilities that govern how sequences of bytes are interpreted as human-readable text across UTF-8, UTF-16, UTF-32, and legacy locale-dependent encodings.

Overview

C++ has always had an uneasy relationship with text encoding. The language predates widespread Unicode adoption, leaving char with no guaranteed encoding, wchar_t with an implementation-defined width, and the runtime locale as the primary abstraction for non-ASCII text.

C++11 introduced the first explicit Unicode character types — char16_t and char32_t — along with corresponding string types and literal prefixes. C++20 completed the picture with char8_t, giving UTF-8 a distinct type that separates it from the ambiguous char. Understanding how these types compose, and where the historical facilities still reach, is essential for writing portable, correct text-handling code.

Character Types

Type	Width	Encoding guarantee	Introduced
`char`	≥8 bits	Execution character set (unspecified)	C++98
`unsigned char`	≥8 bits	Raw bytes, no encoding assumption	C++98
`wchar_t`	Implementation-defined	Wide execution character set	C++98
`char16_t`	Exactly 16 bits	UTF-16 code units	C++11
`char32_t`	Exactly 32 bits	UTF-32 code points	C++11
`char8_t`	Exactly 8 bits	UTF-8 code units	C++20

wchar_t is 16 bits on Windows (UTF-16) and 32 bits on most POSIX systems (UTF-32). This width ambiguity makes wchar_t non-portable for Unicode interchange; prefer char16_t or char32_t for explicit cross-platform Unicode.

String Types

cpp

Godbolt

#include <string>

std::string    s1 = "hello";       // char, encoding unspecified
std::wstring   s2 = L"hello";      // wchar_t
std::u16string s3 = u"hello";      // char16_t, UTF-16  // C++11
std::u32string s4 = U"hello";      // char32_t, UTF-32  // C++11
std::u8string  s5 = u8"hello";     // char8_t, UTF-8    // C++20

Prior to C++20, u8"..." literals had type const char*. Since C++20 they have type const char8_t*, which is a breaking change: code that assigned a u8 literal directly to const char* requires an explicit cast or a migration to char8_t.

Syntax

Literal Prefixes

cpp

Godbolt

// No prefix: execution character set (often UTF-8 on modern systems, but not guaranteed)
const char*     a = "café";

// L: wide, implementation-defined encoding
const wchar_t*  b = L"café";

// u8: UTF-8 code units
// Type is const char* before C++20, const char8_t* since C++20
const char8_t*  c = u8"café";      // C++20

// u: UTF-16 code units (surrogate pairs for code points above U+FFFF)
const char16_t* d = u"café";       // C++11

// U: UTF-32 code points (one element per code point, always)
const char32_t* e = U"café";       // C++11

Universal Character Names

Any literal can embed Unicode scalars with \u (4-hex-digit BMP) or \U (8-hex-digit full range):

cpp

Godbolt

const char32_t* snowman = U"\U00002603";   // ☃, U+2603
const char16_t* arrow   = u"\u2192";       // →
const char8_t*  euro    = u8"\u20ac";      // €, encoded as 3 UTF-8 bytes  // C++20

Examples

UTF-8 Round-trip (C++20)

C++20 char8_t lets the compiler enforce that UTF-8 strings do not accidentally mix with locale-encoded char data:

cpp

Godbolt

#include <string>
#include <stdexcept>

// Accepts only well-typed UTF-8 — mixing with char requires an explicit cast
std::u8string append_nl(std::u8string text) {   // C++20
    text += u8'\n';
    return text;
}

// Interop with C APIs that expect char* requires reinterpret_cast
void write_utf8(const std::u8string& s) {
    std::fwrite(
        reinterpret_cast<const char*>(s.data()),  // C++20: explicit cast required
        1, s.size(), stdout
    );
}

Iterating Code Points over UTF-32

When you need iteration by Unicode code point rather than code unit, u32string is the simplest approach: every char32_t element is exactly one code point.

cpp

Godbolt

#include <string>
#include <algorithm>

std::size_t count_codepoints(const std::u32string& s) {   // C++11
    return s.size();  // one-to-one mapping: element == code point
}

bool contains_bmp(const std::u32string& s) {               // C++11
    return std::any_of(s.begin(), s.end(), [](char32_t cp) {
        return cp <= U'\uFFFF';
    });
}

Converting Between Encodings via `<cuchar>` (C++11)

std::c32rtomb and std::mbrtoc32 provide low-level, locale-dependent conversion between multibyte sequences and UTF-32. They are stateful and re-entrant when you carry the mbstate_t across calls:

cpp

Godbolt

#include <cuchar>   // C++11
#include <cstring>
#include <string>

// Convert a single UTF-32 code point to its UTF-8 byte sequence.
// Assumes the locale's multibyte encoding is UTF-8.
std::string cp_to_utf8(char32_t cp) {
    char buf[MB_LEN_MAX];
    std::mbstate_t state{};
    std::size_t n = std::c32rtomb(buf, cp, &state);  // C++11
    if (n == static_cast<std::size_t>(-1))
        throw std::runtime_error("invalid code point");
    return {buf, n};
}

`std::codecvt` (C++98, deprecated C++17, removed C++26)

std::codecvt facets were the standard mechanism for encoding conversion throughout C++98–C++14. They were deprecated in C++17 and removed in C++26. Avoid them in new code; existing code should migrate to platform APIs, ICU, or <cuchar>.

cpp

Godbolt

// DEPRECATED since C++17 — do not use in new code
#include <locale>
#include <codecvt>  // deprecated header

std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;  // deprecated C++17
std::string utf8 = conv.to_bytes(U"hello");

If you need a drop-in replacement without a third-party dependency, <cuchar> functions or platform APIs (MultiByteToWideChar / WideCharToMultiByte on Windows, iconv on POSIX) are the practical alternatives.

Best Practices

Treat char as UTF-8 at your boundaries. On Linux and macOS the execution charset is nearly always UTF-8. Adopt UTF-8 as your internal encoding convention and validate at ingestion points (file I/O, network, user input). Do not mix encodings silently inside the same std::string.

Use char8_t / u8string for strongly-typed UTF-8 (C++20). The distinct type prevents accidental concatenation with locale-encoded char data and documents intent clearly. The required reinterpret_cast to interface with C APIs is a deliberate friction that makes encoding boundaries visible.

Prefer u32string for code-point–level algorithms. Counting characters, reversing, or indexing by logical character are trivially correct on char32_t sequences. Convert to UTF-8 only at output boundaries.

Normalise before comparing. Two char32_t sequences can represent the same rendered text with different code points (NFC vs NFD). The standard library does not provide normalization; use ICU's unorm2_normalize or similar for correctness under Unicode normalization rules.

Set locale once at startup, then use explicit encoding types. std::locale::global(std::locale("")) configures the C++ runtime to the system locale. Beyond that entry point, pass explicit char8_t, char16_t, or char32_t strings rather than relying on implicit locale-dependent behaviour.

Common Pitfalls

strlen gives bytes, not characters. On a UTF-8 std::string, s.size() counts char elements, not Unicode code points. A five-code-point string containing CJK characters may have 15 bytes.

Surrogate pairs in u16string. UTF-16 requires two char16_t elements (a surrogate pair) for code points above U+FFFF. Iterating u16string by index does not give code points — it gives code units. Use char32_t if you need code-point iteration.

Mixing u8 literals with const char* breaks in C++20. Code that compiled cleanly in C++17 may fail with -std=c++20 if it assigns u8"..." to const char*. Audit call sites and decide: cast with reinterpret_cast<const char*>, or migrate to char8_t throughout.

wchar_t is not Unicode. On Windows, wchar_t is 16-bit and the encoding is UTF-16 — which means surrogate pairs for supplementary characters. On POSIX, it is usually 32-bit. Writing wchar_t-based cross-platform Unicode code that handles supplementary characters correctly requires runtime width checks or a deliberate choice to target only one platform.

std::codecvt removal in C++26. Any codebase using <codecvt> headers or std::wstring_convert will fail to compile under C++26. Plan migration ahead of compiler upgrades.