Regex Advanced: Iterators, Token Splitting, and Replacement
Deep dive into std::regex_iterator, regex_token_iterator, regex_replace, capture groups, and regex grammar flags introduced in C++11.
std::regex_iterator / std::regex_token_iteratorsince C++11Input iterators introduced in C++11 that traverse successive matches (or sub-matches) of a compiled std::basic_regex within a character sequence, enabling full iteration over all occurrences of a pattern in a string.
Overview
The C++11 <regex> facility is structured in three algorithmic layers: single-match primitives (regex_match, regex_search), iterative traversal (regex_iterator, regex_token_iterator), and in-place replacement (regex_replace). The iterator layer is what makes practical text processing viable β finding every occurrence of a pattern, extracting structured fields from log lines, or splitting on arbitrary delimiters.
Type aliases β the library provides concrete aliases parameterized on common iterator types:
| Alias | Underlying type |
|---|---|
std::sregex_iterator | regex_iterator<string::const_iterator> |
std::cregex_iterator | regex_iterator<const char*> |
std::sregex_token_iterator | regex_token_iterator<string::const_iterator> |
std::cregex_token_iterator | regex_token_iterator<const char*> |
Grammar flags β std::regex_constants::syntax_option_type selects which grammar the engine parses. Six grammars are available since C++11: ECMAScript (default), basic (POSIX BRE), extended (POSIX ERE), awk, grep, and egrep. The ECMAScript grammar is the most expressive β it supports lookahead, non-greedy quantifiers, non-capturing groups ((?:...)), and backreferences. The POSIX grammars exist for interoperability with POSIX utilities and should rarely be chosen in new code.
The key behavioral difference from the single-match functions: regex_iterator advances through the input sequence matching each non-overlapping occurrence, while regex_token_iterator can yield a configurable set of capture-group indices per match, or yield the gaps between matches when passed -1.
Syntax
// regex_iterator β each dereference yields const match_results<BidirIt>&
std::regex_iterator<BidirIt> it(first, last, re);
std::regex_iterator<BidirIt> end; // default-constructed = end sentinel
// regex_token_iterator β each dereference yields const sub_match<BidirIt>&
// submatch: int, initializer_list<int>, or vector<int>
// submatch = -1 β yields gaps between matches (splitting)
std::regex_token_iterator<BidirIt> it(first, last, re, submatch);
std::regex_token_iterator<BidirIt> end;
// regex_replace β string overload
std::string result = std::regex_replace(input, re, fmt);
std::string result = std::regex_replace(input, re, fmt, flags);
// regex_replace β output-iterator overload (avoids intermediate allocation)
std::regex_replace(out_it, first, last, re, fmt);Examples
1. Word-frequency map with sregex_iterator
#include <regex>
#include <string>
#include <unordered_map>
#include <iostream>
int main() {
const std::string text =
"the quick brown fox jumps over the lazy dog the fox";
const std::regex word_re(R"(\b\w+\b)"); // C++11 raw string literal
std::unordered_map<std::string, int> freq;
for (std::sregex_iterator it(text.cbegin(), text.cend(), word_re), end;
it != end; ++it)
{
++freq[(*it)[0].str()]; // sub_match[0] is always the full match
}
for (auto& [word, count] : freq) // C++17 structured bindings
std::cout << word << ": " << count << '\n';
}2. Extracting structured fields with capture-group indices
regex_token_iterator accepts an initializer list of group indices. For each match it cycles through them in order, so you get groups 1, 2, 3 for match 1, then 1, 2, 3 for match 2, and so on.
#include <regex>
#include <string>
#include <iostream>
int main() {
const std::string log =
"2024-01-15 ERROR kernel: segfault at 0x0\n"
"2024-01-16 WARN disk: slow I/O\n"
"2024-01-16 INFO net: link up\n";
// Groups: 1=date, 2=level, 3=message
const std::regex entry_re(
R"((\d{4}-\d{2}-\d{2})\s+(ERROR|WARN|INFO)\s+(.+))");
std::sregex_token_iterator it(
log.cbegin(), log.cend(), entry_re, {1, 2, 3});
std::sregex_token_iterator end;
while (it != end) {
std::string date = (it++)->str();
std::string level = (it++)->str();
std::string msg = (it++)->str();
std::cout << '[' << level << "] " << date << " β " << msg << '\n';
}
}Each logical record requires exactly three iterator advances β one per group index in the list. Failing to advance in lockstep with the group count is the most common bug when using this overload.
3. Splitting on a delimiter with submatch = -1
Passing -1 as the sub-match index yields the regions between matches, turning regex_token_iterator into a flexible string splitter that handles multi-character and regex-based delimiters.
#include <regex>
#include <string>
#include <vector>
#include <iostream>
std::vector<std::string> split(const std::string& s, const std::string& sep) {
const std::regex sep_re(sep);
std::sregex_token_iterator it(s.cbegin(), s.cend(), sep_re, -1);
return {it, std::sregex_token_iterator{}};
}
int main() {
for (auto& tok : split("one::two:::three::four", R"(:{2,})"))
std::cout << tok << '\n'; // one / two / three / four
}4. regex_replace with back-references in format strings
Format strings use $1, $2, β¦ for numbered capture groups, $& for the entire match, and $' / $` for the suffix and prefix of the match.
#include <regex>
#include <string>
#include <iostream>
int main() {
// Reformat ISO dates YYYY-MM-DD β DD/MM/YYYY
const std::string input = "Filed: 2024-01-15, expires 2024-12-31.";
const std::regex date_re(R"((\d{4})-(\d{2})-(\d{2}))");
std::cout << std::regex_replace(input, date_re, "$3/$2/$1") << '\n';
// Filed: 15/01/2024, expires 31/12/2024.
// Replace only the first match using format_first_only flag (C++11)
using F = std::regex_constants::match_flag_type;
std::cout << std::regex_replace(input, date_re, "$3/$2/$1",
F::format_first_only) << '\n';
// Filed: 15/01/2024, expires 2024-12-31.
}5. Streaming output with the output-iterator overload
#include <regex>
#include <string>
#include <iterator>
#include <iostream>
int main() {
const std::string src = "user=alice&pass=s3cr3t&token=abc123";
const std::regex secret_re(R"((pass|token)=\S+?)(?=&|$))");
// Write directly to stdout β no intermediate string allocation
std::regex_replace(
std::ostreambuf_iterator<char>(std::cout),
src.cbegin(), src.cend(),
secret_re, "$1=[REDACTED]");
// user=alice&pass=[REDACTED]&token=[REDACTED]
}Best Practices
Compile the regex exactly once. std::basic_regex construction parses and compiles the pattern β this is where nearly all the cost lives. Constructing a regex inside a hot loop or a frequently called function is the most common <regex> performance mistake. Use static const locals (thread-safe since C++11) or class members:
bool is_ipv4(const std::string& s) {
static const std::regex re(R"(^(\d{1,3}\.){3}\d{1,3}$)"); // compiled once
return std::regex_match(s, re);
}Use std::regex::optimize when the same compiled pattern will be matched against many inputs. It permits the engine to spend more effort at compile time in exchange for faster match execution:
const std::regex re(R"(\bERROR\b)",
std::regex::ECMAScript | std::regex::optimize); // C++11Use raw string literals. R"(...)" eliminates double-escaping. The pattern \\b\\w+\\.\\w+ becomes the far more readable \b\w+\.\w+.
Keep the regex alive for the lifetime of all iterators derived from it. regex_iterator and regex_token_iterator store a pointer to the std::basic_regex, not a copy. Destroying the regex while iterators are still live is undefined behavior.
Common Pitfalls
Catastrophic backtracking. Patterns such as (a+)+$ exhibit exponential time on near-matching strings. Any pattern that allows multiple overlapping ways to match the same prefix is a candidate. Test patterns against adversarial inputs before deploying, and be especially careful with patterns derived from user input.
regex_match vs regex_search confusion. regex_match requires the entire string to match; regex_search finds a match anywhere in the sequence. Using regex_match(s, re) where re matches a sub-string will silently return false β a common, hard-to-diagnose bug.
Mistaken sub-match cycling. When regex_token_iterator is constructed with multiple group indices, each call to ++it advances to the next group index, not the next match. Code that treats a multi-group token iterator as if it advances one match per increment will misalign groups after the first record.
Locale-sensitive character classes. [[:alpha:]] and the meaning of \w depend on the locale imbued in the std::regex object. For portable ASCII-only matching, use explicit character ranges ([a-zA-Z]) or stick with the default ECMAScript grammar with the default locale.
See Also
std::regex_match/std::regex_searchβ single-match entry pointsstd::match_results/std::smatchβ the match container type yielded byregex_iteratorstd::sub_matchβ the half-open iterator pair type representing each capture groupstd::regex_errorβ exception thrown for malformed patterns, with acode()member fromstd::regex_constants::error_type