Design#

There are two sides to pw_tokenizer, which we call tokenization and detokenization.

Tokenization converts string literals in the source code to binary tokens at compile time. If the string has printf-style arguments, these are encoded to compact binary form at runtime.
Detokenization converts tokenized strings back to the original human-readable strings.

Here’s an overview of what happens when pw_tokenizer is used:

During compilation, the pw_tokenizer module hashes string literals to generate stable 32-bit tokens.
The tokenization macro removes these strings by declaring them in an ELF section that is excluded from the final binary.
After compilation, strings are extracted from the ELF to build a database of tokenized strings for use by the detokenizer. The ELF file may also be used directly.
During operation, the device encodes the string token and its arguments, if any.
The encoded tokenized strings are sent off-device or stored.
Off-device, the detokenizer tools use the token database to decode the strings to human-readable form.

Encoding#

Pigweed AI summary: The paragraph discusses the encoding of a token, which is a 32-bit hash calculated during compilation. The token is encoded in little-endian format, with the token followed by any arguments. An example is given where a 31-byte string is hashed to 0xdac9a244 and encoded as 4 bytes: 44 a2 c9 da. The paragraph also explains how different types of arguments are encoded, such as integers, floating point numbers, and strings. It suggests that when

The token is a 32-bit hash calculated during compilation. The string is encoded little-endian with the token followed by arguments, if any. For example, the 31-byte string You can go about your business. hashes to 0xdac9a244. This is encoded as 4 bytes: 44 a2 c9 da.

Arguments are encoded as follows:

Integers (1–10 bytes) – ZagZag and varint encoded, similarly to Protocol Buffers. Smaller values take fewer bytes.
Floating point numbers (4 bytes) – Single precision floating point.
Strings (1–128 bytes) – Length byte followed by the string contents. The top bit of the length whether the string was truncated or not. The remaining 7 bits encode the string length, with a maximum of 127 bytes.

Tip

%s arguments can quickly fill a tokenization buffer. Keep %s arguments short or avoid encoding them as strings (e.g. encode an enum as an integer instead of a string). See also Tokenized strings as %s arguments.

Token generation: fixed length hashing at compile time#

Pigweed AI summary: The paragraph discusses the token generation process using fixed length hashing at compile time. It mentions that string tokens are generated using a modified version of the x65599 hash, and all hashing is done at compile time. In C code, strings are hashed with a preprocessor macro, and the maximum number of characters for the hash is determined by a configuration value. Increasing this value increases compilation time. In C++, a constexpr function is used instead of a macro, which has a lower compilation time impact. However

String tokens are generated using a modified version of the x65599 hash used by the SDBM project. All hashing is done at compile time.

In C code, strings are hashed with a preprocessor macro. For compatibility with macros, the hash must be limited to a fixed maximum number of characters. This value is set by PW_TOKENIZER_CFG_C_HASH_LENGTH. Increasing PW_TOKENIZER_CFG_C_HASH_LENGTH increases the compilation time for C due to the complexity of the hashing macros.

C++ macros use a constexpr function instead of a macro. This function works with any length of string and has lower compilation time impact than the C macros. For consistency, C++ tokenization uses the same hash algorithm, but the calculated values will differ between C and C++ for strings longer than PW_TOKENIZER_CFG_C_HASH_LENGTH characters.

Tokenized fields in protocol buffers#

Pigweed AI summary: The text discusses different ways to represent text in protocol buffers, including plain ASCII or UTF-8 text, base64-encoded tokenized messages, binary-encoded tokenized messages, and little-endian 32-bit integer tokens. It also mentions the "pw_tokenizer" library for working with tokenized text in protobuf fields. The text explains the "pw.tokenizer.format" protobuf field option for indicating that a field may contain a tokenized string. It describes the decoding process for optionally tokenized fields,

Text may be represented in a few different ways:

Plain ASCII or UTF-8 text (This is plain text)
Base64-encoded tokenized message ($ibafcA==)
Binary-encoded tokenized message (89 b6 9f 70)
Little-endian 32-bit integer token (0x709fb689)

pw_tokenizer provides tools for working with protobuf fields that may contain tokenized text.

See Protobuf tokenization library for guidance on tokenizing protobufs in Python.

Tokenized field protobuf option#

Pigweed AI summary: The pw_tokenizer library provides a protobuf field option called pw.tokenizer.format, which can be applied to indicate that a field may contain a tokenized string. A tokenized string is represented by a single bytes field annotated with (pw.tokenizer.format) = TOKENIZATION_OPTIONAL. An example protobuf message is provided to demonstrate the use of this option.

pw_tokenizer provides the pw.tokenizer.format protobuf field option. This option may be applied to a protobuf field to indicate that it may contain a tokenized string. A string that is optionally tokenized is represented with a single bytes field annotated with (pw.tokenizer.format) = TOKENIZATION_OPTIONAL.

For example, the following protobuf has one field that may contain a tokenized string.

message MessageWithOptionallyTokenizedField {
  bytes just_bytes = 1;
  bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
  string just_text = 3;
}

Decoding optionally tokenized strings#

Pigweed AI summary: The encoding used for an optionally tokenized field is not recorded in the protobuf, but it can still be reliably decoded. The decoding process involves attempting to decode the field as binary or Base64 tokenized data before treating it as plain text. There are potential decoding problems, such as accidentally interpreting plain text as tokenized binary or displaying undecoded binary as plain text instead of Base64. However, these issues can be mitigated by appending an invalid UTF-8 character to prevent misinterpretation.

The encoding used for an optionally tokenized field is not recorded in the protobuf. Despite this, the text can reliably be decoded. This is accomplished by attempting to decode the field as binary or Base64 tokenized data before treating it like plain text.

The following diagram describes the decoding process for optionally tokenized fields in detail.

flowchart TD start([Received bytes]) --> binary binary[Decode as binary tokenized] --> binary_ok binary_ok{Detokenizes successfully?} -->|no| utf8 binary_ok -->|yes| done_binary([Display decoded binary]) utf8[Decode as UTF-8] --> utf8_ok utf8_ok{Valid UTF-8?} -->|no| base64_encode utf8_ok -->|yes| base64 base64_encode[Encode as tokenized Base64] --> display display([Display encoded Base64]) base64[Decode as Base64 tokenized] --> base64_ok base64_ok{Fully or partially detokenized?} -->|no| is_plain_text base64_ok -->|yes| base64_results is_plain_text{Text is printable?} -->|no| base64_encode is_plain_text-->|yes| plain_text base64_results([Display decoded Base64]) plain_text([Display text])

Potential decoding problems#

Pigweed AI summary: The decoding process for optionally tokenized fields is usually accurate, but there are rare situations where it may fail. One potential problem is if a plain-text string is mistakenly decoded as a binary tokenized message, resulting in the display of an incorrect message. While this is unlikely to happen, it can be prevented by appending an invalid UTF-8 character to binary tokenized data. Another issue is when a message fails to decode as binary tokenized and is not valid UTF-8, it is displayed as

The decoding process for optionally tokenized fields will yield correct results in almost every situation. In rare circumstances, it is possible for it to fail, but these can be avoided with a low-overhead mitigation if desired.

There are two ways in which the decoding process may fail.

Accidentally interpreting plain text as tokenized binary#

Pigweed AI summary: The article discusses the possibility of plain-text strings being interpreted as binary tokenized messages, resulting in incorrect messages being displayed. While this is unlikely to occur, it is possible to prevent it by appending a byte never valid in UTF-8 to binary tokenized data that is valid UTF-8. When decoding, the extra byte is discarded.

If a plain-text string happens to decode as a binary tokenized message, the incorrect message could be displayed. This is very unlikely to occur. While many tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely that a device will happen to log one of these strings as plain text. The overwhelming majority of these strings will be nonsense.

If an implementation wishes to guard against this extremely improbable situation, it is possible to prevent it. This situation is prevented by appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data that happens to be valid UTF-8 (or all binary tokenized messages, if desired). When decoding, if there is an extra 0xFF byte, it is discarded.

Displaying undecoded binary as plain text instead of Base64#

Pigweed AI summary: This paragraph discusses the display of undecoded binary messages as plain text instead of Base64. If a message cannot be decoded as binary tokenized and is not valid UTF-8, it is displayed as tokenized Base64. However, there is a possibility that some binary tokenized messages may be displayed as plain text if the token database is out-of-date. This situation is not a serious issue as only a small number of strings will be displayed incorrectly and cannot be decoded anyway. Updating the token

If a message fails to decode as binary tokenized and it is not valid UTF-8, it is displayed as tokenized Base64. This makes it easily recognizable as a tokenized message and makes it simple to decode later from the text output (for example, with an updated token database).

A binary message for which the token is not known may coincidentally be valid UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters. When decoding with an out-of-date token database, it is possible that some binary tokenized messages will be displayed as plain text rather than tokenized Base64.

This situation is likely to occur, but should be infrequent. Even if it does happen, it is not a serious issue. A very small number of strings will be displayed incorrectly, but these strings cannot be decoded anyway. One nonsense string (e.g. a-D1) would be displayed instead of another ($YS1EMQ==). Updating the token database would resolve the issue, though the non-Base64 logs would be difficult decode later from a log file.

This situation can be avoided with the same approach described in Accidentally interpreting plain text as tokenized binary. Appending an invalid UTF-8 character prevents the undecoded binary message from being interpreted as plain text.

Base64 format#

Pigweed AI summary: The Base64 format is used to encode messages into a compact binary representation. However, applications may need a textual representation of tokenized strings for easy use alongside plain text messages. This comes at a small efficiency cost, as Base64 messages occupy about 4/3 (133%) as much memory as binary messages. The Base64 format consists of a "$" character followed by the Base64-encoded contents of the tokenized message. For example, if the string "This is an example: %d

The tokenizer encodes messages to a compact binary representation. Applications may desire a textual representation of tokenized strings. This makes it easy to use tokenized messages alongside plain text messages, but comes at a small efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory as binary messages.

The Base64 format is comprised of a $ character followed by the Base64-encoded contents of the tokenized message. For example, consider tokenizing the string This is an example: %d! with the argument -1. The string’s token is 0x4b016e66.

Source code: PW_LOG("This is an example: %d!", -1);

 Plain text: This is an example: -1! [23 bytes]

     Binary: 66 6e 01 4b 01          [ 5 bytes]

     Base64: $Zm4BSwE=               [ 9 bytes]

See Base64 guides for guidance on encoding and decoding Base64 messages.

Token databases#

Pigweed AI summary: Token databases store a mapping of tokens to the strings they represent. They can be used to decode tokenized strings from ELF files. Token databases contain the token, removal date (if any), and string for each tokenized string. There are three supported token database formats: CSV, binary, and directory. The CSV format has three columns: the token in hexadecimal, the removal date (if any), and the string literal. The binary format consists of a header followed by entries that store the token and

Token databases store a mapping of tokens to the strings they represent. An ELF file can be used as a token database, but it only contains the strings for its exact build. A token database file aggregates tokens from multiple ELF files, so that a single database can decode tokenized strings from any known ELF.

Token databases contain the token, removal date (if any), and string for each tokenized string.

For help with using token databases, see Managing token databases.

Token database formats#

Pigweed AI summary: This document describes the three token database formats supported by pw_tokenizer: CSV, binary, and directory. The CSV format has three columns and can contain removal dates, while the binary format is more compact and easily processed. The directory format allows for the use of multiple CSV files and is optimized for storage in a Git repository. The database command line tool supports various options for managing the database, including discarding temporary tokens.

Three token database formats are supported: CSV, binary, and directory. Tokens may also be read from ELF files or .a archives, but cannot be written to these formats.

CSV database format#

Pigweed AI summary: The CSV database format consists of three columns: hexadecimal token, removal date (if any), and a string literal surrounded by quotes. The example database contains six strings, three of which have removal dates. Quote characters within the string are represented as two quote characters.

The CSV database format has three columns: the token in hexadecimal, the removal date (if any) in year-month-day format, and the string literal, surrounded by quotes. Quote characters within the string are represented as two quote characters.

This example database contains six strings, three of which have removal dates.

141c35d5,          ,"The answer: ""%s"""
2e668cd6,2019-12-25,"Jello, world!"
7b940e2a,          ,"Hello %s! %hd %e"
851beeb6,          ,"%u %d"
881436a0,2020-01-01,"The answer is: %s"
e13b0f94,2020-04-01,"%llu"

Binary database format#

Pigweed AI summary: This section describes the binary database format used for storing tokens and their removal dates, with each entry consisting of a token and a 0xFFFFFFFF removal date if there is none. The string literals are stored in the same order as the entries, with null terminators. The binary form of the CSV database is also shown, which is more compact and easily processed than the CSV database.

The binary database format is comprised of a 16-byte header followed by a series of 8-byte entries. Each entry stores the token and the removal date, which is 0xFFFFFFFF if there is none. The string literals are stored next in the same order as the entries. Strings are stored with null terminators. See token_database.h for full details.

The binary form of the CSV database is shown below. It contains the same information, but in a more compact and easily processed form. It takes 141 B compared with the CSV database’s 211 B.

[header]
0x00: 454b4f54 0000534e  TOKENS..
0x08: 00000006 00000000  ........

[entries]
0x10: 141c35d5 ffffffff  .5......
0x18: 2e668cd6 07e30c19  ..f.....
0x20: 7b940e2a ffffffff  *..{....
0x28: 851beeb6 ffffffff  ........
0x30: 881436a0 07e40101  .6......
0x38: e13b0f94 07e40401  ..;.....

[string table]
0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.

Directory database format#

Pigweed AI summary: The pw_tokenizer tool can consume directories of CSV databases, which are searched recursively for files with a specific suffix. The format is optimized for storage in a Git repository alongside source code, and the tool supports commands for consolidating files and discarding temporary tokens.

pw_tokenizer can consume directories of CSV databases. A directory database will be searched recursively for files with a .pw_tokenizer.csv suffix, all of which will be used for subsequent detokenization lookups.

An example directory database might look something like this:

token_database
├── chuck_e_cheese.pw_tokenizer.csv
├── fungi_ble.pw_tokenizer.csv
└── some_more
    └── arcade.pw_tokenizer.csv

This format is optimized for storage in a Git repository alongside source code. The token database commands randomly generate unique file names for the CSVs in the database to prevent merge conflicts. Running mark_removed or purge commands in the database CLI consolidates the files to a single CSV.

The database command line tool supports a --discard-temporary <upstream_commit> option for add. In this mode, the tool attempts to discard temporary tokens. It identifies the latest CSV not present in the provided <upstream_commit>, and tokens present that CSV that are not in the newly added tokens are discarded. This helps keep temporary tokens (e.g from debug logs) out of the database.

JSON support#

Pigweed AI summary: The pw_tokenizer library does not have a specific JSON database format, but it can create a token database from a JSON array of strings. This is helpful for generating a token database for strings that are not already parsed as tokens in compiled binaries. To learn how to create a token database from a JSON file, refer to the "Create a database" section in the guides.

While pw_tokenizer doesn’t specify a JSON database format, a token database can be created from a JSON formatted array of strings. This is useful for side-band token database generation for strings that are not embedded as parsable tokens in compiled binaries. See Create a database for instructions on generating a token database from a JSON file.

Token collisions#

Pigweed AI summary: Tokens are calculated using a hash function, which means that different strings can hash to the same token. This can lead to multiple strings having the same token in the database, making it difficult to decode a token unambiguously. However, detokenization tools are available to automatically resolve collisions based on whether the tokenized data matches the string arguments and if the string has been marked as removed from the database. The probability of collisions occurring increases as the number of strings hashed increases, especially if fewer than

Tokens are calculated with a hash function. It is possible for different strings to hash to the same token. When this happens, multiple strings will have the same token in the database, and it may not be possible to unambiguously decode a token.

The detokenization tools attempt to resolve collisions automatically. Collisions are resolved based on two things:

whether the tokenized data matches the strings arguments’ (if any), and
if / when the string was marked as having been removed from the database.

See Working with token collisions for guidance on how to fix collisions.

Probability of collisions#

Pigweed AI summary: Hashes of any size have a collision risk, meaning that there is a chance for two different strings to produce the same hash value. This probability of collisions is surprisingly high, especially when using fewer than 32 bits for tokens. The table provided in the section shows the approximate number of strings that can be hashed to have a 1% or 50% probability of at least one collision. It is important to consider this table when masking tokens, as smaller token sizes may be acceptable for certain applications

Hashes of any size have a collision risk. The probability of one at least one collision occurring for a given number of strings is unintuitively high (this is known as the birthday problem). If fewer than 32 bits are used for tokens, the probability of collisions increases substantially.

This table shows the approximate number of strings that can be hashed to have a 1% or 50% probability of at least one collision (assuming a uniform, random hash).

Token bits	Collision probability by string count
Token bits	50%	1%
32	77000	9300
31	54000	6600
24	4800	580
16	300	36
8	19	3

Keep this table in mind when masking tokens (see Smaller tokens with masking). 16 bits might be acceptable when tokenizing a small set of strings, such as module names, but won’t be suitable for large sets of strings, like log messages.

Detokenization#

Pigweed AI summary: Detokenization is the process of expanding a token to the string it represents and decoding its arguments. The detokenization libraries provided by pw_tokenizer support Python, C++, and TypeScript. An example is given of decoding tokenized logs using the Base64 format. The logs are stored in a database and can be decoded using the detokenizing tools. The example also mentions that using Base64 can be a worthwhile tradeoff for projects that want to interleave tokenized with plain text. The paragraph

Detokenization is the process of expanding a token to the string it represents and decoding its arguments. pw_tokenizer provides Python, C++ and TypeScript detokenization libraries.

Example: decoding tokenized logs

A project might tokenize its log messages with the Base64 format. Consider the following log file, which has four tokenized logs and one plain text log:

20200229 14:38:58 INF $HL2VHA==
20200229 14:39:00 DBG $5IhTKg==
20200229 14:39:20 DBG Crunching numbers to calculate probability of success
20200229 14:39:21 INF $EgFj8lVVAUI=
20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=

The project’s log strings are stored in a database like the following:

1c95bd1c,          ,"Initiating retrieval process for recovery object"
2a5388e4,          ,"Determining optimal approach and coordinating vectors"
3743540c,          ,"Recovery object retrieval failed with status %s"
f2630112,          ,"Calculated acceptable probability of success (%.2f%%)"

Using the detokenizing tools with the database, the logs can be decoded:

20200229 14:38:58 INF Initiating retrieval process for recovery object
20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
20200229 14:39:20 DBG Crunching numbers to calculate probability of success
20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY

Note

This example uses the Base64 format, which occupies about 4/3 (133%) as much space as the default binary format when encoded. For projects that wish to interleave tokenized with plain text, using Base64 is a worthwhile tradeoff.

See Detokenization guides for detailed instructions on how to do detokenization in different programming languages.

Python detokenization: C99 `printf` compatibility notes#

Pigweed AI summary: This paragraph summarizes the Python detokenization implementation's compatibility with the C99 printf specification. The implementation aligns with section 7.19.6 of the C99 specification, but there may be slight differences in how different compilers interpret undefined behavior. The paragraph also lists the supported format for detokenization, including flags, width, precision, length, and specifier. It mentions some underspecified and non-conformant details of the implementation.

This implementation is designed to align with the C99 specification, section 7.19.6. Notably, this specification is slightly different than what is implemented in most compilers due to each compiler choosing to interpret undefined behavior in slightly different ways. Treat the following description as the source of truth.

This implementation supports:

Overall Format: %[flags][width][.precision][length][specifier]
Flags (Zero or More)
- -: Left-justify within the given field width; Right justification is the default (see Width modifier).
- +: Forces to preceed the result with a plus or minus sign (+ or -) even for positive numbers. By default, only negative numbers are preceded with a - sign.
- (space): If no sign is going to be written, a blank space is inserted before the value.
- #: Specifies an alternative print syntax should be used.
  
  Used with o, x or X specifiers the value is preceeded with 0, 0x or 0X, respectively, for values different than zero.
  
  Used with a, A, e, E, f, F, g, or G it forces the written output to contain a decimal point even if no more digits follow. By default, if no digits follow, no decimal point is written.
- 0: Left-pads the number with zeroes (0) instead of spaces when padding is specified (see width sub-specifier).
Width (Optional)
- (number): Minimum number of characters to be printed. If the value to be printed is shorter than this number, the result is padded with blank spaces or 0 if the 0 flag is present. The value is not truncated even if the result is larger. If the value is negative and the 0 flag is present, the 0s are padded after the - symbol.
- *: The width is not specified in the format string, but as an additional integer value argument preceding the argument that has to be formatted.
Precision (Optional)
- .(number)
  
  For d, i, o, u, x, X, specifies the minimum number of digits to be written. If the value to be written is shorter than this number, the result is padded with leading zeros. The value is not truncated even if the result is longer.
  
  A precision of 0 means that no character is written for the value 0.
  
  For a, A, e, E, f, and F, specifies the number of digits to be printed after the decimal point. By default, this is 6.
  
  For g and G, specifies the maximum number of significant digits to be printed.
  
  For s, specifies the maximum number of characters to be printed. By default all characters are printed until the ending null character is encountered.
  
  If the period is specified without an explicit value for precision, 0 is assumed.
- .*: The precision is not specified in the format string, but as an additional integer value argument preceding the argument that has to be formatted.
Length (Optional)
- hh: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a signed char or unsigned char. However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).
- h: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a signed short int or unsigned short int. However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).
- l: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a signed long int or unsigned long int. Also is usable with c and s to specify that the arguments will be encoded with wchar_t values (which isn’t different from normal char values). However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).
- ll: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a signed long long int or unsigned long long int. This is required to properly decode the argument as a 64-bit integer.
- L: Usable with a, A, e, E, f, F, g, or G conversion specifiers applies to a long double argument. However, this is ignored in the implementation due to floating point value encoded that is unaffected by bit width.
- j: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a intmax_t or uintmax_t.
- z: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a size_t. This will force the argument to be decoded as an unsigned integer.
- t: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a ptrdiff_t.
- If a length modifier is provided for an incorrect specifier, it is ignored.
Specifier (Required)
- d / i: Used for signed decimal integers.
- u: Used for unsigned decimal integers.
- o: Used for unsigned decimal integers and specifies formatting should be as an octal number.
- x: Used for unsigned decimal integers and specifies formatting should be as a hexadecimal number using all lowercase letters.
- X: Used for unsigned decimal integers and specifies formatting should be as a hexadecimal number using all uppercase letters.
- f: Used for floating-point values and specifies to use lowercase, decimal floating point formatting.
  - Default precision is 6 decimal places unless explicitly specified.
- F: Used for floating-point values and specifies to use uppercase, decimal floating point formatting.
  - Default precision is 6 decimal places unless explicitly specified.
- e: Used for floating-point values and specifies to use lowercase, exponential (scientific) formatting.
  - Default precision is 6 decimal places unless explicitly specified.
- E: Used for floating-point values and specifies to use uppercase, exponential (scientific) formatting.
  - Default precision is 6 decimal places unless explicitly specified.
- g: Used for floating-point values and specified to use f or e formatting depending on which would be the shortest representation.
  - Precision specifies the number of significant digits, not just digits after the decimal place.
  - If the precision is specified as 0, it is interpreted to mean 1.
  - e formatting is used if the the exponent would be less than -4 or is greater than or equal to the precision.
  - Trailing zeros are removed unless the # flag is set.
  - A decimal point only appears if it is followed by a digit.
  - NaN or infinities always follow f formatting.
- G: Used for floating-point values and specified to use f or e formatting depending on which would be the shortest representation.
  - Precision specifies the number of significant digits, not just digits after the decimal place.
  - If the precision is specified as 0, it is interpreted to mean 1.
  - E formatting is used if the the exponent would be less than -4 or is greater than or equal to the precision.
  - Trailing zeros are removed unless the # flag is set.
  - A decimal point only appears if it is followed by a digit.
  - NaN or infinities always follow F formatting.
- c: Used for formatting a char value.
- s: Used for formatting a string of char values.
  - If width is specified, the null terminator character is included as a character for width count.
  - If precision is specified, no more chars than that value will be written from the string (padding is used to fill additional width).
- p: Used for formatting a pointer address.
- %: Prints a single %. Only valid as %% (supports no flags, width, precision, or length modifiers).

Underspecified details:

If both + and (space) flags appear, the (space) is ignored.
The + and (space) flags will error if used with c or s.
The # flag will error if used with d, i, u, c, s, or p.
The 0 flag will error if used with c, s, or p.
Both + and (space) can work with the unsigned integer specifiers u, o, x, and X.
If a length modifier is provided for an incorrect specifier, it is ignored.
The z length modifier will decode arugments as signed as long as d or i is used.
p is implementation defined.
- For this implementation, it will print with a 0x prefix and then the pointer value was printed using %08X.
- p supports the +, -, and (space) flags, but not the # or 0 flags.
- None of the length modifiers are usable with p.
- This implementation will try to adhere to user-specified width (assuming the width provided is larger than the guaranteed minimum of 10).
- Specifying precision for p is considered an error.
Only %% is allowed with no other modifiers. Things like %+% will fail to decode. Some C stdlib implementations support any modifiers being present between %, but ignore any for the output.
If a width is specified with the 0 flag for a negative value, the padded 0s will appear after the - symbol.
A precision of 0 for d, i, u, o, x, or X means that no character is written for the value 0.
Precision cannot be specified for c.
Using * or fixed precision with the s specifier still requires the string argument to be null-terminated. This is due to argument encoding happening on the C/C++-side while the precision value is not read or otherwise used until decoding happens in this Python code.

Non-conformant details:

n specifier: We do not support the n specifier since it is impossible for us to retroactively tell the original program how many characters have been printed since this decoding happens a great deal of time after the device sent it, usually on a separate processing device entirely.

Limitations and future work#

Pigweed AI summary: The paragraph discusses the limitations and future work related to tokenization in template functions in GCC. It mentions that GCC incorrectly ignores the section attribute for template functions and variables, causing tokenized strings to be emitted into the .rodata section instead of the tokenized string section. This leads to problems such as tokenized strings not being discovered by token database tools and not being removed from the final binary. The paragraph suggests two workarounds: using Clang instead of GCC, or moving tokenization calls to

GCC bug: tokenization in template functions#

Pigweed AI summary: A bug in GCC causes tokenized strings in template functions to be emitted into .rodata instead of the special tokenized string section, which can cause problems with discovering tokenized strings and removing them from the final binary. Two workarounds are suggested: using Clang, which puts the string data in the requested section, or moving tokenization calls to a non-templated context by creating a separate non-templated function and invoking it from the template. A third option, compiling the binary

GCC incorrectly ignores the section attribute for template functions and variables. For example, the following won’t work when compiling with GCC and tokenized logging:

template <...>
void DoThings() {
  int value = GetValue();
  // This log won't work with tokenized logs due to the templated context.
  PW_LOG_INFO("Got value: %d", value);
  ...
}

The bug causes tokenized strings in template functions to be emitted into .rodata instead of the special tokenized string section. This causes two problems:

Tokenized strings will not be discovered by the token database tools.
Tokenized strings may not be removed from the final binary.

There are two workarounds.

Use Clang. Clang puts the string data in the requested section, as expected. No extra steps are required.

Move tokenization calls to a non-templated context. Creating a separate non-templated function and invoking it from the template resolves the issue. This enables tokenizing in most cases encountered in practice with templates.

// In .h file:
void LogThings(value);

template <...>
void DoThings() {
  int value = GetValue();
  // This log will work: calls non-templated helper.
  LogThings(value);
  ...
}

// In .cc file:
void LogThings(int value) {
  // Tokenized logging works as expected in this non-templated context.
  PW_LOG_INFO("Got value %d", value);
}

There is a third option, which isn’t implemented yet, which is to compile the binary twice: once to extract the tokens, and once for the production binary (without tokens). If this is interesting to you please get in touch.

64-bit tokenization#

Pigweed AI summary: The detokenizing libraries in Python and C++ assume that strings were tokenized on a system with 32-bit types, which may cause decoding issues if a 64-bit device performed the tokenization. To support detokenization of strings tokenized on 64-bit targets, an option to switch the 32-bit types to 64-bit could be added. The tokenizer stores the sizes of these types in the .pw_tokenizer.info ELF section, allowing for verification of the sizes if necessary.

The Python and C++ detokenizing libraries currently assume that strings were tokenized on a system with 32-bit long, size_t, intptr_t, and ptrdiff_t. Decoding may not work correctly for these types if a 64-bit device performed the tokenization.

Supporting detokenization of strings tokenized on 64-bit targets would be simple. This could be done by adding an option to switch the 32-bit types to 64-bit. The tokenizer stores the sizes of these types in the .pw_tokenizer.info ELF section, so the sizes of these types can be verified by checking the ELF file, if necessary.

Tokenization in headers#

Pigweed AI summary: Tokenizing code in header files can lead to warnings such as -Wlto-type-mismatch due to the need to declare a character array for each tokenized string. If macros that change value are included in the tokenized string, the size of the character array changes, resulting in the same static variable being defined with different sizes. While it is safe to suppress these warnings, it is recommended to move code that tokenizes strings with macros that can change value to source files instead of headers.

Tokenizing code in header files (inline functions or templates) may trigger warnings such as -Wlto-type-mismatch under certain conditions. That is because tokenization requires declaring a character array for each tokenized string. If the tokenized string includes macros that change value, the size of this character array changes, which means the same static variable is defined with different sizes. It should be safe to suppress these warnings, but, when possible, code that tokenizes strings with macros that can change value should be moved to source files rather than headers.

Tokenized strings as `%s` arguments#

Pigweed AI summary: The paragraph discusses the inefficiency of encoding string arguments as plain text, suggesting the use of tokenized strings as integer arguments instead. However, this feature is not yet supported. It also mentions the possibility of sending a string token as an integer by marking it in a specific way for detokenization. The paragraph further explains that encoding strings with arguments to a buffer would not work due to null-termination, but they can be sent as Base64-encoded strings. Another option is to encode strings with arguments

Encoding %s string arguments is inefficient, since %s strings are encoded 1:1, with no tokenization. It would be better to send a tokenized string literal as an integer instead of a string argument, but this is not yet supported.

A string token could be sent by marking an integer % argument in a way recognized by the detokenization tools. The detokenizer would expand the argument to the string represented by the integer.

#define PW_TOKEN_ARG PRIx32 "<PW_TOKEN]"

constexpr uint32_t answer_token = PW_TOKENIZE_STRING("Uh, who is there");

PW_TOKENIZE_STRING("Knock knock: %" PW_TOKEN_ARG "?", answer_token);

Strings with arguments could be encoded to a buffer, but since printf strings are null-terminated, a binary encoding would not work. These strings can be prefixed Base64-encoded and sent as %s instead. See Base64 format.

Another possibility: encode strings with arguments to a uint64_t and send them as an integer. This would be efficient and simple, but only support a small number of arguments.

Deployment war story#

Pigweed AI summary: The tokenizer module was developed to implement tokenized logging in an existing product. The deployment of tokenization was straightforward and resulted in significant benefits. The log contents were reduced by over 50%, even with Base64 encoding. This led to size savings for encoded logs and freed up valuable communication bandwidth. Additionally, the tokenizer allowed for storing more logs in crash dumps and reduced the size of firmware images by up to 18%. The logging code was also simplified by removing CPU-heavy snprintf calls and complex code for

The tokenizer module was developed to bring tokenized logging to an in-development product. The product already had an established text-based logging system. Deploying tokenization was straightforward and had substantial benefits.

Results#

Pigweed AI summary: The Results section highlights the benefits of using Base64 encoding for log contents, including significant size savings and freed communication bandwidth. Additionally, the use of Base64 encoding allowed for storing more logs in crash dumps and reduced firmware image size by up to 18%. The logging code was also simplified by removing CPU-heavy calls and complex code for forwarding log arguments. The section concludes by mentioning the deployment process for the tokenizer and key insights.

Log contents shrunk by over 50%, even with Base64 encoding.
- Significant size savings for encoded logs, even using the less-efficient Base64 encoding required for compatibility with the existing log system.
- Freed valuable communication bandwidth.
- Allowed storing many more logs in crash dumps.
Substantial flash savings.
- Reduced the size firmware images by up to 18%.
Simpler logging code.
- Removed CPU-heavy snprintf calls.
- Removed complex code for forwarding log arguments to a low-priority task.

This section describes the tokenizer deployment process and highlights key insights.

Firmware deployment#

Pigweed AI summary: The paragraph discusses the firmware deployment process. It mentions that in the project's logging macro, the calls to the logging function were replaced with a tokenized log macro invocation. The log level was passed as the payload argument to allow for runtime log level control. In order to encode the log messages as text, the log messages were encoded in Base64 format and then dispatched as normal log messages in the handler function. It also states that asserts were tokenized using a callback-based API that has been removed,

In the project’s logging macro, calls to the underlying logging function were replaced with a tokenized log macro invocation.
The log level was passed as the payload argument to facilitate runtime log level control.
For this project, it was necessary to encode the log messages as text. In the handler function the log messages were encoded in the $-prefixed Base64 format, then dispatched as normal log messages.
Asserts were tokenized a callback-based API that has been removed (a custom macro is a better alternative).

Attention

Do not encode line numbers in tokenized strings. This results in a huge number of lines being added to the database, since every time code moves, new strings are tokenized. If pw_log_tokenized is used, line numbers are encoded in the log metadata. Line numbers may also be included by by adding "%d" to the format string and passing __LINE__.

Database management#

Pigweed AI summary: The paragraph summarizes the information about database management. It mentions that the token database was stored as a CSV file in the project's Git repo and was automatically updated as part of the build. Developers were expected to check in the database changes alongside their code changes. A presubmit check verified that all strings added by a change were added to the token database. The token database included logs and asserts for all firmware images in the project, and no strings were purged from the token database. The tip suggests

The token database was stored as a CSV file in the project’s Git repo.
The token database was automatically updated as part of the build, and developers were expected to check in the database changes alongside their code changes.
A presubmit check verified that all strings added by a change were added to the token database.
The token database included logs and asserts for all firmware images in the project.
No strings were purged from the token database.

Tip

Merge conflicts may be a frequent occurrence with an in-source CSV database. Use the Directory database format instead.

Decoding tooling deployment#

Pigweed AI summary: The Python detokenizer in pw_tokenizer was deployed to two places: product-specific Python command line tools and a standalone script for decoding prefixed Base64 tokens in files or live output. The C++ detokenizer library was deployed to two Android apps with a Java Native Interface (JNI) layer, including the binary token database as a raw resource in the APK. The tip suggests making tokenized logging tools simple to use for the project by providing simple wrapper shell scripts, using AutoUpdatingDetokenizer

The Python detokenizer in pw_tokenizer was deployed to two places:
- Product-specific Python command line tools, using pw_tokenizer.Detokenizer.
- Standalone script for decoding prefixed Base64 tokens in files or live output (e.g. from adb), using detokenize.py’s command line interface.
The C++ detokenizer library was deployed to two Android apps with a Java Native Interface (JNI) layer.
- The binary token database was included as a raw resource in the APK.
- In one app, the built-in token database could be overridden by copying a file to the phone.

Tip

Make the tokenized logging tools simple to use for your project.

Provide simple wrapper shell scripts that fill in arguments for the project. For example, point detokenize.py to the project’s token databases.
Use pw_tokenizer.AutoUpdatingDetokenizer to decode in continuously-running tools, so that users don’t have to restart the tool when the token database updates.
Integrate detokenization everywhere it is needed. Integrating the tools takes just a few lines of code, and token databases can be embedded in APKs or binaries.

pw_tokenizer#

Cut your log sizes in half

Design#

Encoding#

Token generation: fixed length hashing at compile time#

Tokenized fields in protocol buffers#

Tokenized field protobuf option#

Decoding optionally tokenized strings#

Potential decoding problems#

Accidentally interpreting plain text as tokenized binary#

Displaying undecoded binary as plain text instead of Base64#

Base64 format#

Token databases#

Token database formats#

CSV database format#

Binary database format#

Directory database format#

JSON support#

Token collisions#

Probability of collisions#

Detokenization#

Python detokenization: C99 `printf` compatibility notes#

Limitations and future work#

GCC bug: tokenization in template functions#

64-bit tokenization#

Tokenization in headers#

Tokenized strings as `%s` arguments#

Deployment war story#

Results#

Firmware deployment#

Database management#

Decoding tooling deployment#

pw_tokenizer#

Cut your log sizes in half

Design#

Encoding#

Token generation: fixed length hashing at compile time#

Tokenized fields in protocol buffers#

Tokenized field protobuf option#

Decoding optionally tokenized strings#

Potential decoding problems#

Accidentally interpreting plain text as tokenized binary#

Displaying undecoded binary as plain text instead of Base64#

Base64 format#

Token databases#

Token database formats#

CSV database format#

Binary database format#

Directory database format#

JSON support#

Token collisions#

Probability of collisions#

Detokenization#

Python detokenization: C99 printf compatibility notes#

Limitations and future work#

GCC bug: tokenization in template functions#

64-bit tokenization#

Tokenization in headers#

Tokenized strings as %s arguments#

Deployment war story#

Results#

Firmware deployment#

Database management#

Decoding tooling deployment#

Python detokenization: C99 `printf` compatibility notes#

Tokenized strings as `%s` arguments#