pw_tokenizer - Pigweed

Pigweed AI summary: The pw_tokenizer module helps developers compress strings to reduce the size of logs by more than 75%. It replaces printf-style strings with binary tokens during compilation, resulting in extensive logging with less memory usage. The module is designed to integrate easily into existing logging systems and can be used to tokenize any strings. Tokenizing strings offers several benefits, including reducing binary size, I/O traffic, RAM, and flash usage, as well as removing potentially sensitive information from binaries. The module is not related to parsing

Stable C11 C++14 Python TypeScript Code Size Impact: 50% reduction in binary log size

Logging is critical, but developers are often forced to choose between additional logging or saving crucial flash space. The pw_tokenizer module helps address this by replacing printf-style strings with binary tokens during compilation. This enables extensive logging with substantially less memory usage.

Note

This usage of the term “tokenizer” is not related to parsing! The module is called tokenizer because it replaces a whole string literal with an integer token. It does not parse strings into separate tokens.

The most common application of pw_tokenizer is binary logging, and it is designed to integrate easily into existing logging systems. However, the tokenizer is general purpose and can be used to tokenize any strings, with or without printf-style arguments.

Why tokenize strings?

Dramatically reduce binary size by removing string literals from binaries.
Reduce I/O traffic, RAM, and flash usage by sending and storing compact tokens instead of strings. We’ve seen over 50% reduction in encoded log contents.
Reduce CPU usage by replacing snprintf calls with simple tokenization code.
Remove potentially sensitive log, assert, and other strings from binaries.

See Design for a more detailed explanation of how pw_tokenizer works.

Example: tokenized logging#

Pigweed AI summary: This paragraph provides an example of using the "pw_tokenizer" module for tokenized logging. It explains that tokenized logging can significantly reduce the binary and encoded size of log messages. The example shows the size comparison between plain text logging and tokenized logging, demonstrating the reduction in size. It also includes a table that illustrates the size difference at different stages of logging. The paragraph concludes by mentioning a "toctree-wrapper" compound that contains links to other sections related to the "pw_tokenizer

This example demonstrates using pw_tokenizer for logging. In this example, tokenized logging saves ~90% in binary size (41 → 4 bytes) and 70% in encoded size (49 → 15 bytes).

Before: plain text logging

Location	Logging Content	Size in bytes
Source contains	`LOG("Battery state: %s; battery voltage: %d mV", state, voltage);`
Binary contains	`"Battery state: %s; battery voltage: %d mV"`	41
	(log statement is called with `"CHARGING"` and `3989` as arguments)
Device transmits	`"Battery state: CHARGING; battery voltage: 3989 mV"`	49
When viewed	`"Battery state: CHARGING; battery voltage: 3989 mV"`

After: tokenized logging

Location

Logging Content

Size in bytes

Source contains

LOG("Battery state: %s; battery voltage: %d mV", state, voltage);

Binary contains

d9 28 47 8e (0x8e4728d9)

(log statement is called with "CHARGING" and 3989 as arguments)

Device transmits

`d9 28 47 8e`	`08 43 48 41 52 47 49 4E 47`	`aa 3e`
Token	`"CHARGING"` argument	`3989`, as varint

When viewed

"Battery state: CHARGING; battery voltage: 3989 mV"

pw_tokenizer#

Compress strings to shrink logs by +75%

Example: tokenized logging#