Text Encoding
C++ character types, encoding-aware string types, and conversion facilities for UTF-8, UTF-16, UTF-32, and locale-dependent encodings.
Text Encodingsince C++11C++ text encoding encompasses the character types, string classes, literal prefixes, and conversion facilities that govern how sequences of bytes are interpreted as human-readable text across UTF-8, UTF-16, UTF-32, and legacy locale-dependent encodings.
Overview
C++ has always had an uneasy relationship with text encoding. The language predates widespread Unicode adoption, leaving char with no guaranteed encoding, wchar_t with an implementation-defined width, and the runtime locale as the primary abstraction for non-ASCII text.
C++11 introduced the first explicit Unicode character types — char16_t and char32_t — along with corresponding string types and literal prefixes. C++20 completed the picture with char8_t, giving UTF-8 a distinct type that separates it from the ambiguous char. Understanding how these types compose, and where the historical facilities still reach, is essential for writing portable, correct text-handling code.
Character Types
| Type | Width | Encoding guarantee | Introduced |
|---|---|---|---|
char | ≥8 bits | Execution character set (unspecified) | C++98 |
unsigned char | ≥8 bits | Raw bytes, no encoding assumption | C++98 |
wchar_t | Implementation-defined | Wide execution character set | C++98 |
char16_t | Exactly 16 bits | UTF-16 code units | C++11 |
char32_t | Exactly 32 bits | UTF-32 code points | C++11 |
char8_t | Exactly 8 bits | UTF-8 code units | C++20 |
wchar_t is 16 bits on Windows (UTF-16) and 32 bits on most POSIX systems (UTF-32). This width ambiguity makes wchar_t non-portable for Unicode interchange; prefer char16_t or char32_t for explicit cross-platform Unicode.
String Types
#include <string>
std::string s1 = "hello"; // char, encoding unspecified
std::wstring s2 = L"hello"; // wchar_t
std::u16string s3 = u"hello"; // char16_t, UTF-16 // C++11
std::u32string s4 = U"hello"; // char32_t, UTF-32 // C++11
std::u8string s5 = u8"hello"; // char8_t, UTF-8 // C++20Prior to C++20, u8"..." literals had type const char*. Since C++20 they have type const char8_t*, which is a breaking change: code that assigned a u8 literal directly to const char* requires an explicit cast or a migration to char8_t.
Syntax
Literal Prefixes
// No prefix: execution character set (often UTF-8 on modern systems, but not guaranteed)
const char* a = "café";
// L: wide, implementation-defined encoding
const wchar_t* b = L"café";
// u8: UTF-8 code units
// Type is const char* before C++20, const char8_t* since C++20
const char8_t* c = u8"café"; // C++20
// u: UTF-16 code units (surrogate pairs for code points above U+FFFF)
const char16_t* d = u"café"; // C++11
// U: UTF-32 code points (one element per code point, always)
const char32_t* e = U"café"; // C++11Universal Character Names
Any literal can embed Unicode scalars with \u (4-hex-digit BMP) or \U (8-hex-digit full range):
const char32_t* snowman = U"\U00002603"; // ☃, U+2603
const char16_t* arrow = u"\u2192"; // →
const char8_t* euro = u8"\u20ac"; // €, encoded as 3 UTF-8 bytes // C++20Examples
UTF-8 Round-trip (C++20)
C++20 char8_t lets the compiler enforce that UTF-8 strings do not accidentally mix with locale-encoded char data:
#include <string>
#include <stdexcept>
// Accepts only well-typed UTF-8 — mixing with char requires an explicit cast
std::u8string append_nl(std::u8string text) { // C++20
text += u8'\n';
return text;
}
// Interop with C APIs that expect char* requires reinterpret_cast
void write_utf8(const std::u8string& s) {
std::fwrite(
reinterpret_cast<const char*>(s.data()), // C++20: explicit cast required
1, s.size(), stdout
);
}Iterating Code Points over UTF-32
When you need iteration by Unicode code point rather than code unit, u32string is the simplest approach: every char32_t element is exactly one code point.
#include <string>
#include <algorithm>
std::size_t count_codepoints(const std::u32string& s) { // C++11
return s.size(); // one-to-one mapping: element == code point
}
bool contains_bmp(const std::u32string& s) { // C++11
return std::any_of(s.begin(), s.end(), [](char32_t cp) {
return cp <= U'\uFFFF';
});
}Converting Between Encodings via <cuchar> (C++11)
std::c32rtomb and std::mbrtoc32 provide low-level, locale-dependent conversion between multibyte sequences and UTF-32. They are stateful and re-entrant when you carry the mbstate_t across calls:
#include <cuchar> // C++11
#include <cstring>
#include <string>
// Convert a single UTF-32 code point to its UTF-8 byte sequence.
// Assumes the locale's multibyte encoding is UTF-8.
std::string cp_to_utf8(char32_t cp) {
char buf[MB_LEN_MAX];
std::mbstate_t state{};
std::size_t n = std::c32rtomb(buf, cp, &state); // C++11
if (n == static_cast<std::size_t>(-1))
throw std::runtime_error("invalid code point");
return {buf, n};
}std::codecvt (C++98, deprecated C++17, removed C++26)
std::codecvt facets were the standard mechanism for encoding conversion throughout C++98–C++14. They were deprecated in C++17 and removed in C++26. Avoid them in new code; existing code should migrate to platform APIs, ICU, or <cuchar>.
// DEPRECATED since C++17 — do not use in new code
#include <locale>
#include <codecvt> // deprecated header
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv; // deprecated C++17
std::string utf8 = conv.to_bytes(U"hello");If you need a drop-in replacement without a third-party dependency, <cuchar> functions or platform APIs (MultiByteToWideChar / WideCharToMultiByte on Windows, iconv on POSIX) are the practical alternatives.
Best Practices
Treat char as UTF-8 at your boundaries. On Linux and macOS the execution charset is nearly always UTF-8. Adopt UTF-8 as your internal encoding convention and validate at ingestion points (file I/O, network, user input). Do not mix encodings silently inside the same std::string.
Use char8_t / u8string for strongly-typed UTF-8 (C++20). The distinct type prevents accidental concatenation with locale-encoded char data and documents intent clearly. The required reinterpret_cast to interface with C APIs is a deliberate friction that makes encoding boundaries visible.
Prefer u32string for code-point–level algorithms. Counting characters, reversing, or indexing by logical character are trivially correct on char32_t sequences. Convert to UTF-8 only at output boundaries.
Normalise before comparing. Two char32_t sequences can represent the same rendered text with different code points (NFC vs NFD). The standard library does not provide normalization; use ICU's unorm2_normalize or similar for correctness under Unicode normalization rules.
Set locale once at startup, then use explicit encoding types. std::locale::global(std::locale("")) configures the C++ runtime to the system locale. Beyond that entry point, pass explicit char8_t, char16_t, or char32_t strings rather than relying on implicit locale-dependent behaviour.
Common Pitfalls
strlen gives bytes, not characters. On a UTF-8 std::string, s.size() counts char elements, not Unicode code points. A five-code-point string containing CJK characters may have 15 bytes.
Surrogate pairs in u16string. UTF-16 requires two char16_t elements (a surrogate pair) for code points above U+FFFF. Iterating u16string by index does not give code points — it gives code units. Use char32_t if you need code-point iteration.
Mixing u8 literals with const char* breaks in C++20. Code that compiled cleanly in C++17 may fail with -std=c++20 if it assigns u8"..." to const char*. Audit call sites and decide: cast with reinterpret_cast<const char*>, or migrate to char8_t throughout.
wchar_t is not Unicode. On Windows, wchar_t is 16-bit and the encoding is UTF-16 — which means surrogate pairs for supplementary characters. On POSIX, it is usually 32-bit. Writing wchar_t-based cross-platform Unicode code that handles supplementary characters correctly requires runtime width checks or a deliberate choice to target only one platform.
std::codecvt removal in C++26. Any codebase using <codecvt> headers or std::wstring_convert will fail to compile under C++26. Plan migration ahead of compiler upgrades.
See Also
<cuchar>—mbrtoc32,c32rtomb,mbrtoc16,c16rtombfor stateful multibyte/Unicode conversion (C++11)<locale>—std::locale,std::ctype, and the remaining (non-deprecated) codecvt infrastructurereference/library/utilities/charconv—std::from_chars/std::to_charsfor locale-independent number formatting