Phases of Translation
The nine sequential phases the C++ standard mandates to transform raw source text into a linked executable.
Phases of Translationsince C++98The C++ standard defines nine sequential phases that transform raw source bytes into a linked executable, where each phase's output feeds directly into the next, and conforming programs must behave as if these phases ran in strict order.
Overview
The standard does not mandate how a compiler is implemented β only that observable behaviour matches the nine-phase model. Real toolchains interleave phases for performance, but correctness is judged against the sequential model. Internalising the phase order resolves a class of otherwise mysterious bugs: why __LINE__ expands before a macro argument is evaluated, why string literal concatenation happens after preprocessing, why an undefined reference is a linker error rather than a compiler error, and why dependent names in templates resolve differently than non-dependent ones.
Phase 1 β Character Mapping
The source file's physical bytes are mapped to the translation character set. Characters outside the basic source character set are replaced by universal character names (\uNNNN / \UNNNNNNNN). Prior to C++23 the translation character set was implementation-defined; since C++23 it is Unicode, making encoding behaviour significantly more portable.
Practical consequence: a file saved as UTF-8 can contain non-ASCII identifiers and string content β the compiler normalises them in Phase 1 before anything else runs.
Phase 2 β Line Splicing
Every line whose last character before the newline is \ is joined with the following line; the backslash-newline pair is deleted. This occurs before tokenisation, meaning the splice can break or join tokens:
#define LONG_\
MACRO 42 // defines LONG_MACRO after splicing
const char* msg = "hel\
lo"; // valid: same as "hello"A common trap: a backslash at the end of a // comment continues the comment to the next line, silently swallowing the following statement:
// do nothing here \
int x = 0; // this line is inside the comment β x is never declaredPhase 3 β Tokenisation into Preprocessing Tokens
The spliced character stream is decomposed into preprocessing tokens and whitespace. Categories are: header names, identifiers, preprocessing numbers, character literals, string literals, operators/punctuators, and a catch-all for individual characters that fit no other category.
The governing rule is maximal munch: the lexer always forms the longest valid token. This is why a+++b parses as (a++) + b rather than a + (++b). It is also why, prior to C++11, vector<vector<int>> was a compilation error: >> was tokenised as the right-shift operator. C++11 introduced context-sensitive angle-bracket handling to fix this.
Preprocessing numbers are intentionally a superset of integer and floating-point literals β 1.5e+10 and even 0x1p+3 are valid preprocessing numbers in Phase 3. Types and values are assigned only in Phase 7.
Phase 4 β Preprocessing
The preprocessing token stream is executed: #include directives splice in other files (recursively restarting from Phase 1 for each included file), macro definitions are recorded, macro invocations are expanded, and #if/#ifdef/#elif/#else/#endif prune branches. #pragma, _Pragma, and #line are also consumed here.
#define MAX(a, b) ((a) > (b) ? (a) : (b))
// After Phase 4 the compiler sees no directives β only the expansion:
int x = MAX(foo(), bar());
// becomes: int x = ((foo()) > (bar()) ? (foo()) : (bar()));The entire #include graph β often hundreds of headers β is flattened into a single preprocessing token sequence. This is why inclusion order can silently change behaviour and why headers lacking include guards cause redefinition errors.
Since C++20, import declarations bypass Phase 4 for module units. Preprocessor macros defined in one module unit do not cross module boundaries, which is one of the key hygiene guarantees modules provide.
Phase 5 β String Literal Encoding
The encoding of each character and string literal is resolved. Source-file characters within literals are converted to the execution character set. The full range of encoding prefixes is handled here:
const char* a = "hello"; // execution narrow encoding (since C++98)
const wchar_t* b = L"hello"; // execution wide encoding (since C++98)
const char16_t* c = u"hello"; // UTF-16 (since C++11)
const char32_t* d = U"hello"; // UTF-32 (since C++11)
const char8_t* e = u8"hello"; // UTF-8, type char8_t (since C++20)Phase 6 β String Literal Concatenation
Adjacent string literal tokens are concatenated into a single literal. This is the mechanism behind multi-line string idioms:
const char* sql =
"SELECT id, name "
"FROM users "
"WHERE active = 1"; // three tokens merged into one hereBecause Phase 6 follows Phase 4, macro expansion happens first. You cannot rely on a macro to produce one half of a concatenated pair β both halves must already be string literal tokens before Phase 6 runs.
Concatenating literals of differing encoding prefixes follows defined compatibility rules. Mixing a narrow literal with u8 is well-defined; mixing u with U is ill-formed.
Phase 7 β Compilation
The preprocessed, concatenated token sequence is parsed, type-checked, and translated into an object file (or equivalent). Semantic analysis, overload resolution, unqualified and qualified name lookup, template parsing (distinct from full instantiation), constant evaluation, and code generation all occur here. This phase produces the vast majority of diagnostics developers encounter.
// The extern reference is unresolved here β the linker handles it in Phase 9.
extern int external_counter;
int read_counter() { return external_counter; }Since C++11, constexpr functions may be evaluated here for constant expressions. Since C++20, consteval functions must be fully evaluated here β any consteval call that cannot be resolved at compile time is a hard error in Phase 7.
Phase 8 β Template Instantiation
Template specialisations required by the translation unit are instantiated. Non-dependent names were resolved in Phase 7; dependent names are looked up here, at the point of instantiation. This is two-phase lookup, mandated since C++98 and a frequent source of portability bugs when code was only tested on compilers that deferred all lookup to instantiation time.
template <typename T>
void print(const T& v) {
// std::cout β non-dependent, looked up in Phase 7
// v.describe() β dependent, looked up in Phase 8 in T's associated namespaces
std::cout << v.describe() << '\n';
}SFINAE (since C++98, formalised further in C++11) and Concepts (since C++20) constrain which instantiations are attempted, preventing Phase 8 from firing on non-viable candidates.
Phase 9 β Linking
Translation units and library artefacts are combined. The linker resolves external references, merges sections, applies link-time optimisation (LTO) if enabled, and produces the final executable or shared object. Errors that survive to Phase 9 β undefined reference, multiple definition, ODR violations β indicate mismatched declarations across translation units or missing library dependencies.
// translation_a.cpp
void helper() { /* definition */ }
// translation_b.cpp
extern void helper(); // declaration β resolved in Phase 9
void caller() { helper(); }Common Pitfalls
Macro ordering relative to #include. Phase 4 processes directives sequentially within each translation unit. A macro intended to guard or configure a header must be #defined before the #include, not after.
Line continuation inside comments. A \ at the end of a // comment (consumed in Phase 2) silently absorbs the next line. Compilers may warn about this with -Wcomment.
>> in nested templates before C++11. map<string, vector<int>> requires C++11. In C++03 the space before >> is mandatory: map<string, vector<int> >.
ODR violations across translation units. If two TUs define the same class or inline function with differing bodies, Phase 9 behaviour is undefined; the linker may pick one silently. C++20 modules prevent this for module-owned entities.
Two-phase lookup in derived templates. Accessing a base class member through an unqualified name inside a template fails Phase 7 lookup on conforming compilers because the base is dependent. Use this->member or Base<T>::member to make the dependency explicit.
See Also
reference/language/dependent-namesreference/language/constant-expressionsreference/language/adlreference/language/attributes