API reference#

Compatibility#

Pigweed AI summary: The paragraph states that the compatibility of a certain module includes C11, C++14, and Python 3.

C11
C++14
Python 3

Tokenization#

Tokenization converts a string literal to a token. If it’s a printf-style string, its arguments are encoded along with it. The results of tokenization can be sent off device or stored in place of a full string.

typedef uint32_t pw_tokenizer_Token#: The type of the 32-bit token used in place of a string. Also available as pw::tokenizer::Token.

Tokenization macros#

Adding tokenization to a project is simple. To tokenize a string, include pw_tokenizer/tokenize.h and invoke one of the PW_TOKENIZE_ macros.

Tokenize a string literal#

pw_tokenizer provides macros for tokenizing string literals with no arguments.

PW_TOKENIZE_STRING(string_literal)#

Converts a string literal to a pw_tokenizer_Token (uint32_t) token in a standalone statement. C and C++ compatible. In C++, the string may be a literal or a constexpr char array, including function variables like __func__. In C, the argument must be a string literal. In either case, the string must be null terminated, but may contain any characters (including ‘\0’).

constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");

PW_TOKENIZE_STRING_DOMAIN(domain, string_literal)#: Tokenizes a string literal in a standalone statement using the specified domain . C and C++ compatible.

PW_TOKENIZE_STRING_MASK(domain, mask, string_literal)#: Tokenizes a string literal in a standalone statement using the specified domain and bit mask . C and C++ compatible.

The tokenization macros above cannot be used inside other expressions.

Yes: Assign PW_TOKENIZE_STRING to a constexpr variable.

constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!");

void Function() {
  constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?");
}

No: Use PW_TOKENIZE_STRING in another expression.

void BadExample() {
  ProcessToken(PW_TOKENIZE_STRING("This won't compile!"));
}

Use PW_TOKENIZE_STRING_EXPR instead.

An alternate set of macros are provided for use inside expressions. These make use of lambda functions, so while they can be used inside expressions, they require C++ and cannot be assigned to constexpr variables or be used with special function variables like __func__.

PW_TOKENIZE_STRING_EXPR(string_literal)#

Converts a string literal to a uint32_t token within an expression. Requires C++.

DoSomething(PW_TOKENIZE_STRING_EXPR("Succeed"));

PW_TOKENIZE_STRING_DOMAIN_EXPR(domain, string_literal)#: Tokenizes a string literal using the specified domain within an expression. Requires C++.

PW_TOKENIZE_STRING_MASK_EXPR(domain, mask, string_literal)#: Tokenizes a string literal using the specified domain and bit mask within an expression. Requires C++.

When to use these macros

Use PW_TOKENIZE_STRING and related macros to tokenize string literals that do not need %-style arguments encoded.

Yes: Use PW_TOKENIZE_STRING_EXPR within other expressions.

void GoodExample() {
  ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!"));
}

No: Assign PW_TOKENIZE_STRING_EXPR to a constexpr variable.

constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!"));

Instead, use PW_TOKENIZE_STRING to assign to a constexpr variable.

No: Tokenize __func__ in PW_TOKENIZE_STRING_EXPR.

void BadExample() {
  // This compiles, but __func__ will not be the outer function's name, and
  // there may be compiler warnings.
  constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__);
}

Instead, use PW_TOKENIZE_STRING to tokenize __func__ or similar macros.

Tokenize a message with arguments to a buffer#

PW_TOKENIZE_TO_BUFFER(buffer, buffer_size_pointer, format, ...)#

Encodes a tokenized string and arguments to the provided buffer. The size of the buffer is passed via a pointer to a size_t. After encoding is complete, the size_t is set to the number of bytes written to the buffer.

The macro’s arguments are equivalent to the following function signature:

TokenizeToBuffer(void* buffer,
                 size_t* buffer_size_pointer,
                 const char* format,
                 ...);  // printf-style arguments

For example, the following encodes a tokenized string with a temperature to a buffer. The buffer is passed to a function to send the message over a UART.

uint8_t buffer[32];
size_t size_bytes = sizeof(buffer);
PW_TOKENIZE_TO_BUFFER(
    buffer, &size_bytes, "Temperature (C): %0.2f", temperature_c);
MyProject_EnqueueMessageForUart(buffer, size);

While PW_TOKENIZE_TO_BUFFER is very flexible, it must be passed a buffer, which increases its code size footprint at the call site.

PW_TOKENIZE_TO_BUFFER_DOMAIN(domain, buffer, buffer_size_pointer, format, ...)#: Same as PW_TOKENIZE_TO_BUFFER , but tokenizes to the specified domain .

PW_TOKENIZE_TO_BUFFER_MASK(domain, mask, buffer, buffer_size_pointer, format, ...)#: Same as PW_TOKENIZE_TO_BUFFER_DOMAIN , but applies a bit mask to the token.

Why use this macro

Encode a tokenized message for consumption within a function.
Encode a tokenized message into an existing buffer.

Avoid using PW_TOKENIZE_TO_BUFFER in widely expanded macros, such as a logging macro, because it will result in larger code size than passing the tokenized data to a function.

Tokenize a message with arguments in a custom macro#

Projects can leverage the tokenization machinery in whichever way best suits their needs. The most efficient way to use pw_tokenizer is to pass tokenized data to a global handler function. A project’s custom tokenization macro can handle tokenized data in a function of their choosing.

pw_tokenizer provides two low-level macros for projects to use to create custom tokenization macros.

PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...)#

Tokenizes a format string with optional arguments and sets the _pw_tokenizer_token variable to the token. Must be used in its own scope, since the same variable is used in every invocation.

The tokenized string uses the specified tokenization domain . Use PW_TOKENIZER_DEFAULT_DOMAIN for the default. The token also may be masked; use UINT32_MAX to keep all bits.

This macro checks that the printf-style format string matches the arguments, stores the format string in a special section, and calculates the string’s token at compile time.

PW_TOKENIZER_ARG_TYPES(...)#

Converts a series of arguments to a compact format that replaces the format string literal. Evaluates to a pw_tokenizer_ArgTypes value.

Depending on the size of pw_tokenizer_ArgTypes, the bottom 4 or 6 bits store the number of arguments and the remaining bits store the types, two bits per type. The arguments are not evaluated; only their types are used.

The outputs of these macros are typically passed to an encoding function. That function encodes the token, argument types, and argument data to a buffer using helpers provided by pw_tokenizer/encode_args.h.

size_t pw::tokenizer::EncodeArgs(pw_tokenizer_ArgTypes types, va_list args, span<std::byte> output)#

Encodes a tokenized string’s arguments to a buffer. The pw_tokenizer_ArgTypes parameter specifies the argument types, in place of a format string.

Most tokenization implementations should use the EncodedMessage class.

template<size_t kMaxSizeBytes = PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES> class EncodedMessage#

Encodes a tokenized message to a fixed size buffer. By default, the buffer size is set by the PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES config macro. This class is used to encode tokenized messages passed in from tokenization macros.

To use pw::tokenizer::EncodedMessage, construct it with the token, argument types, and va_list from the variadic arguments:

void SendLogMessage(span<std::byte> log_data);

extern "C" void TokenizeToSendLogMessage(pw_tokenizer_Token token,
                                         pw_tokenizer_ArgTypes types,
                                         ...) {
  va_list args;
  va_start(args, types);
  EncodedMessage encoded_message(token, types, args);
  va_end(args);

  SendLogMessage(encoded_message);  // EncodedMessage converts to span
}

Public Functions

inline const std::byte *data() const#: The binary-encoded tokenized message.

inline const uint8_t *data_as_uint8() const#: Returns data() as a pointer to uint8_t instead of std::byte.

inline size_t size() const#: The size of the encoded tokenized message in bytes.

size_t pw_tokenizer_EncodeArgs(pw_tokenizer_ArgTypes types, va_list args, void *output_buffer, size_t output_buffer_size)#: C function that encodes arguments to a tokenized buffer. Use the pw::tokenizer::EncodeArgs() function from C++.

Tokenizing function names#

The string literal tokenization functions support tokenizing string literals or constexpr character arrays (constexpr const char[]). In GCC and Clang, the special __func__ variable and __PRETTY_FUNCTION__ extension are declared as static constexpr char[] in C++ instead of the standard static const char[]. This means that __func__ and __PRETTY_FUNCTION__ can be tokenized while compiling C++ with GCC or Clang.

// Tokenize the special function name variables.
constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);

Note that __func__ and __PRETTY_FUNCTION__ are not string literals. They are defined as static character arrays, so they cannot be implicitly concatentated with string literals. For example, printf(__func__ ": %d", 123); will not compile.

Buffer sizing helper#

template<typename ...ArgTypes> constexpr size_t pw::tokenizer::MinEncodingBufferSizeBytes()#

Calculates the minimum buffer size to allocate that is guaranteed to support encoding the specified arguments.

The contents of strings are NOT included in this total. The string’s length/status byte is guaranteed to fit, but the string contents may be truncated. Encoding is considered to succeed as long as the string’s length/status byte is written, even if the actual string is truncated.

Examples:

Message with no arguments: MinEncodingBufferSizeBytes() == 4
Message with an int argument MinEncodingBufferSizeBytes<int>() == 9 (4 + 5)

Tokenization in Python#

The Python pw_tokenizer.encode module has limited support for encoding tokenized messages with the encode_token_and_args function.

pw_tokenizer.encode.encode_token_and_args(token: int, *args: Union[int, float, bytes, str]) → bytes#

Encodes a tokenized message given its token and arguments.

This function assumes that the token represents a format string with conversion specifiers that correspond with the provided argument types. Currently, only 32-bit integers are supported.

This function requires a string’s token is already calculated. Typically these tokens are provided by a database, but they can be manually created using the tokenizer hash.

pw_tokenizer.tokens.pw_tokenizer_65599_hash(string: Union[str, bytes], *, hash_length: Optional[int] = None) → int#

Hashes the string with the hash function used to generate tokens in C++.

This hash function is used calculate tokens from strings in Python. It is not used when extracting tokens from an ELF, since the token is stored in the ELF as part of tokenization.

This is particularly useful for offline token database generation in cases where tokenized strings in a binary cannot be embedded as parsable pw_tokenizer entries.

Note

In C, the hash length of a string has a fixed limit controlled by PW_TOKENIZER_CFG_C_HASH_LENGTH. To match tokens produced by C (as opposed to C++) code, pw_tokenizer_65599_hash() should be called with a matching hash length limit. When creating an offline database, it’s a good idea to generate tokens for both, and merge the databases.

Protobuf tokenization library#

The pw_tokenizer.proto Python module defines functions that may be used to detokenize protobuf objects in Python. The function pw_tokenizer.proto.detokenize_fields() detokenizes all fields annotated as tokenized, replacing them with their detokenized version. For example:

my_detokenizer = pw_tokenizer.Detokenizer(some_database)

my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)

assert my_message.tokenized_field == b'The detokenized string! Cool!'

pw_tokenizer.proto#

Utilities for working with tokenized fields in protobufs.

pw_tokenizer.proto.decode_optionally_tokenized(detokenizer: Detokenizer, data: bytes, prefix: str = '$') → str#: Decodes data that may be plain text or binary / Base64 tokenized text.

pw_tokenizer.proto.detokenize_fields(detokenizer: Detokenizer, proto: Message, prefix: str = '$') → None#

Detokenizes fields annotated as tokenized in the given proto.

The fields are replaced with their detokenized version in the proto. Tokenized fields are bytes fields, so the detokenized string is stored as bytes. Call .decode() to convert the detokenized string from bytes to str.

pw_tokenizer#

Cut your log sizes in half

API reference#

Compatibility#

Tokenization#

Tokenization macros#

Tokenize a string literal#

Tokenize a message with arguments to a buffer#

Tokenize a message with arguments in a custom macro#

Tokenizing function names#

Buffer sizing helper#

Tokenization in Python#

Protobuf tokenization library#

pw_tokenizer.proto#