API reference#
Compatibility#
Pigweed AI summary: The paragraph states that the compatibility of a certain module includes C11, C++14, and Python 3.
C11
C++14
Python 3
Tokenization#
Tokenization converts a string literal to a token. If it’s a printf-style string, its arguments are encoded along with it. The results of tokenization can be sent off device or stored in place of a full string.
-
typedef uint32_t pw_tokenizer_Token#
The type of the 32-bit token used in place of a string. Also available as
pw::tokenizer::Token.
Tokenization macros#
Adding tokenization to a project is simple. To tokenize a string, include
pw_tokenizer/tokenize.h and invoke one of the PW_TOKENIZE_ macros.
Tokenize a string literal#
pw_tokenizer provides macros for tokenizing string literals with no
arguments.
-
PW_TOKENIZE_STRING(string_literal)#
Converts a string literal to a
pw_tokenizer_Token(uint32_t) token in a standalone statement. C and C++ compatible. In C++, the string may be a literal or a constexpr char array, including function variables like__func__. In C, the argument must be a string literal. In either case, the string must be null terminated, but may contain any characters (including ‘\0’).constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
-
PW_TOKENIZE_STRING_DOMAIN(domain, string_literal)#
Tokenizes a string literal in a standalone statement using the specified domain . C and C++ compatible.
-
PW_TOKENIZE_STRING_MASK(domain, mask, string_literal)#
Tokenizes a string literal in a standalone statement using the specified domain and bit mask . C and C++ compatible.
The tokenization macros above cannot be used inside other expressions.
Yes: Assign PW_TOKENIZE_STRING to a constexpr variable.
constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!");
void Function() {
constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?");
}
No: Use PW_TOKENIZE_STRING in another expression.
void BadExample() {
ProcessToken(PW_TOKENIZE_STRING("This won't compile!"));
}
Use PW_TOKENIZE_STRING_EXPR instead.
An alternate set of macros are provided for use inside expressions. These make
use of lambda functions, so while they can be used inside expressions, they
require C++ and cannot be assigned to constexpr variables or be used with
special function variables like __func__.
-
PW_TOKENIZE_STRING_EXPR(string_literal)#
Converts a string literal to a
uint32_ttoken within an expression. Requires C++.DoSomething(PW_TOKENIZE_STRING_EXPR("Succeed"));
-
PW_TOKENIZE_STRING_DOMAIN_EXPR(domain, string_literal)#
Tokenizes a string literal using the specified domain within an expression. Requires C++.
-
PW_TOKENIZE_STRING_MASK_EXPR(domain, mask, string_literal)#
Tokenizes a string literal using the specified domain and bit mask within an expression. Requires C++.
When to use these macros
Use PW_TOKENIZE_STRING and related macros to tokenize string
literals that do not need %-style arguments encoded.
Yes: Use PW_TOKENIZE_STRING_EXPR within other expressions.
void GoodExample() {
ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!"));
}
No: Assign PW_TOKENIZE_STRING_EXPR to a constexpr variable.
constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!"));
Instead, use PW_TOKENIZE_STRING to assign to a constexpr variable.
No: Tokenize __func__ in PW_TOKENIZE_STRING_EXPR.
void BadExample() {
// This compiles, but __func__ will not be the outer function's name, and
// there may be compiler warnings.
constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__);
}
Instead, use PW_TOKENIZE_STRING to tokenize __func__ or similar macros.
Tokenize a message with arguments to a buffer#
-
PW_TOKENIZE_TO_BUFFER(buffer, buffer_size_pointer, format, ...)#
Encodes a tokenized string and arguments to the provided buffer. The size of the buffer is passed via a pointer to a
size_t. After encoding is complete, thesize_tis set to the number of bytes written to the buffer.The macro’s arguments are equivalent to the following function signature:
TokenizeToBuffer(void* buffer, size_t* buffer_size_pointer, const char* format, ...); // printf-style arguments
For example, the following encodes a tokenized string with a temperature to a buffer. The buffer is passed to a function to send the message over a UART.
uint8_t buffer[32]; size_t size_bytes = sizeof(buffer); PW_TOKENIZE_TO_BUFFER( buffer, &size_bytes, "Temperature (C): %0.2f", temperature_c); MyProject_EnqueueMessageForUart(buffer, size);
While
PW_TOKENIZE_TO_BUFFERis very flexible, it must be passed a buffer, which increases its code size footprint at the call site.
-
PW_TOKENIZE_TO_BUFFER_DOMAIN(domain, buffer, buffer_size_pointer, format, ...)#
Same as
PW_TOKENIZE_TO_BUFFER, but tokenizes to the specified domain .
-
PW_TOKENIZE_TO_BUFFER_MASK(domain, mask, buffer, buffer_size_pointer, format, ...)#
Same as
PW_TOKENIZE_TO_BUFFER_DOMAIN, but applies a bit mask to the token.
Why use this macro
Encode a tokenized message for consumption within a function.
Encode a tokenized message into an existing buffer.
Avoid using PW_TOKENIZE_TO_BUFFER in widely expanded macros, such as a
logging macro, because it will result in larger code size than passing the
tokenized data to a function.
Tokenize a message with arguments in a custom macro#
Projects can leverage the tokenization machinery in whichever way best suits
their needs. The most efficient way to use pw_tokenizer is to pass tokenized
data to a global handler function. A project’s custom tokenization macro can
handle tokenized data in a function of their choosing.
pw_tokenizer provides two low-level macros for projects to use
to create custom tokenization macros.
-
PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...)#
Tokenizes a format string with optional arguments and sets the
_pw_tokenizer_tokenvariable to the token. Must be used in its own scope, since the same variable is used in every invocation.The tokenized string uses the specified tokenization domain . Use
PW_TOKENIZER_DEFAULT_DOMAINfor the default. The token also may be masked; useUINT32_MAXto keep all bits.This macro checks that the printf-style format string matches the arguments, stores the format string in a special section, and calculates the string’s token at compile time.
-
PW_TOKENIZER_ARG_TYPES(...)#
Converts a series of arguments to a compact format that replaces the format string literal. Evaluates to a
pw_tokenizer_ArgTypesvalue.Depending on the size of
pw_tokenizer_ArgTypes, the bottom 4 or 6 bits store the number of arguments and the remaining bits store the types, two bits per type. The arguments are not evaluated; only their types are used.
The outputs of these macros are typically passed to an encoding function. That
function encodes the token, argument types, and argument data to a buffer using
helpers provided by pw_tokenizer/encode_args.h.
-
size_t pw::tokenizer::EncodeArgs(pw_tokenizer_ArgTypes types, va_list args, span<std::byte> output)#
Encodes a tokenized string’s arguments to a buffer. The
pw_tokenizer_ArgTypesparameter specifies the argument types, in place of a format string.Most tokenization implementations should use the
EncodedMessageclass.
-
template<size_t kMaxSizeBytes = PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES>
class EncodedMessage# Encodes a tokenized message to a fixed size buffer. By default, the buffer size is set by the
PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTESconfig macro. This class is used to encode tokenized messages passed in from tokenization macros.To use
pw::tokenizer::EncodedMessage, construct it with the token, argument types, andva_listfrom the variadic arguments:void SendLogMessage(span<std::byte> log_data); extern "C" void TokenizeToSendLogMessage(pw_tokenizer_Token token, pw_tokenizer_ArgTypes types, ...) { va_list args; va_start(args, types); EncodedMessage encoded_message(token, types, args); va_end(args); SendLogMessage(encoded_message); // EncodedMessage converts to span }
-
size_t pw_tokenizer_EncodeArgs(pw_tokenizer_ArgTypes types, va_list args, void *output_buffer, size_t output_buffer_size)#
C function that encodes arguments to a tokenized buffer. Use the
pw::tokenizer::EncodeArgs()function from C++.
Tokenizing function names#
The string literal tokenization functions support tokenizing string literals or
constexpr character arrays (constexpr const char[]). In GCC and Clang, the
special __func__ variable and __PRETTY_FUNCTION__ extension are declared
as static constexpr char[] in C++ instead of the standard static const
char[]. This means that __func__ and __PRETTY_FUNCTION__ can be
tokenized while compiling C++ with GCC or Clang.
// Tokenize the special function name variables.
constexpr uint32_t function = PW_TOKENIZE_STRING(__func__);
constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__);
Note that __func__ and __PRETTY_FUNCTION__ are not string literals.
They are defined as static character arrays, so they cannot be implicitly
concatentated with string literals. For example, printf(__func__ ": %d",
123); will not compile.
Buffer sizing helper#
-
template<typename ...ArgTypes>
constexpr size_t pw::tokenizer::MinEncodingBufferSizeBytes()# Calculates the minimum buffer size to allocate that is guaranteed to support encoding the specified arguments.
The contents of strings are NOT included in this total. The string’s length/status byte is guaranteed to fit, but the string contents may be truncated. Encoding is considered to succeed as long as the string’s length/status byte is written, even if the actual string is truncated.
Examples:
Message with no arguments:
MinEncodingBufferSizeBytes() == 4Message with an int argument
MinEncodingBufferSizeBytes<int>() == 9 (4 + 5)
Tokenization in Python#
The Python pw_tokenizer.encode module has limited support for encoding
tokenized messages with the encode_token_and_args function.
- pw_tokenizer.encode.encode_token_and_args(token: int, *args: Union[int, float, bytes, str]) bytes#
Encodes a tokenized message given its token and arguments.
This function assumes that the token represents a format string with conversion specifiers that correspond with the provided argument types. Currently, only 32-bit integers are supported.
This function requires a string’s token is already calculated. Typically these tokens are provided by a database, but they can be manually created using the tokenizer hash.
- pw_tokenizer.tokens.pw_tokenizer_65599_hash(string: Union[str, bytes], *, hash_length: Optional[int] = None) int#
Hashes the string with the hash function used to generate tokens in C++.
This hash function is used calculate tokens from strings in Python. It is not used when extracting tokens from an ELF, since the token is stored in the ELF as part of tokenization.
This is particularly useful for offline token database generation in cases where tokenized strings in a binary cannot be embedded as parsable pw_tokenizer entries.
Note
In C, the hash length of a string has a fixed limit controlled by
PW_TOKENIZER_CFG_C_HASH_LENGTH. To match tokens produced by C (as opposed
to C++) code, pw_tokenizer_65599_hash() should be called with a matching
hash length limit. When creating an offline database, it’s a good idea to
generate tokens for both, and merge the databases.
Protobuf tokenization library#
The pw_tokenizer.proto Python module defines functions that may be used to
detokenize protobuf objects in Python. The function
pw_tokenizer.proto.detokenize_fields() detokenizes all fields annotated as
tokenized, replacing them with their detokenized version. For example:
my_detokenizer = pw_tokenizer.Detokenizer(some_database)
my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)
assert my_message.tokenized_field == b'The detokenized string! Cool!'
pw_tokenizer.proto#
Utilities for working with tokenized fields in protobufs.
- pw_tokenizer.proto.decode_optionally_tokenized(detokenizer: Detokenizer, data: bytes, prefix: str = '$') str#
Decodes data that may be plain text or binary / Base64 tokenized text.
- pw_tokenizer.proto.detokenize_fields(detokenizer: Detokenizer, proto: Message, prefix: str = '$') None#
Detokenizes fields annotated as tokenized in the given proto.
The fields are replaced with their detokenized version in the proto. Tokenized fields are bytes fields, so the detokenized string is stored as bytes. Call .decode() to convert the detokenized string from bytes to str.