Size Optimizations#

This page contains recommendations for optimizing the size of embedded software including its memory and code footprints.

These recommendations are subject to change as the C++ standard and compilers evolve, and as the authors continue to gain more knowledge and experience in this area. If you disagree with recommendations, please discuss them with the Pigweed team, as we’re always looking to improve the guide or correct any inaccuracies.

Compile Time Constant Expressions#

Pigweed AI summary: The use of constexpr and consteval in C++ can allow for the evaluation of function or variable values at compile-time, resulting in smaller sizes and faster execution. However, caution should be taken when marking functions constexpr in APIs that cannot be easily changed in the future, as there is no "mutable" escape hatch for constexpr. The Embedded C++ Guide provides more detail on this topic.

The use of constexpr and soon with C++20 consteval can enable you to evaluate the value of a function or variable more at compile-time rather than only at run-time. This can often not only result in smaller sizes but also often times more efficient, faster execution.

We highly encourage using this aspect of C++, however there is one caveat: be careful in marking functions constexpr in APIs which cannot be easily changed in the future unless you can prove that for all time and all platforms, the computation can actually be done at compile time. This is because there is no “mutable” escape hatch for constexpr.

See the Embedded C++ Guide for more detail.

Templates#

Pigweed AI summary: The compiler generates a separate version of a function for each set of types it is instantiated with when implementing templates, which can significantly increase code size. It is important to be cautious when instantiating non-trivial template functions with multiple types. To share more of the implementation between different instantiations, consider splitting templated interfaces into multiple layers. A more advanced form is to use default sentinel template argument values to share common logic internally, such as with pw::Vector's kMaxSize or pw::span's

The compiler implements templates by generating a separate version of the function for each set of types it is instantiated with. This can increase code size significantly.

Be careful when instantiating non-trivial template functions with multiple types.

Consider splitting templated interfaces into multiple layers so that more of the implementation can be shared between different instantiations. A more advanced form is to share common logic internally by using default sentinel template argument value and ergo instantation such as pw::Vector’s size_t kMaxSize = vector_impl::kGeneric or pw::span’s size_t Extent = dynamic_extent.

Virtual Functions#

Pigweed AI summary: Virtual functions provide runtime polymorphism but should be avoided unless necessary as they increase RAM usage and require extra instructions at each call site, potentially inhibiting compiler optimizations. When runtime polymorphism is required, virtual functions should be used instead of C alternatives, which sacrifice flexibility and ease of use. Templated virtual interfaces should also be avoided as they can compound the cost by instantiating many virtual tables. Devirtualization should be kept in mind when using virtual functions, and declaring class definitions as "final" can

Virtual functions provide for runtime polymorphism. Unless runtime polymorphism is required, virtual functions should be avoided. Virtual functions require a virtual table and a pointer to it in each instance, which all increases RAM usage and requires extra instructions at each call site. Virtual functions can also inhibit compiler optimizations, since the compiler may not be able to tell which functions will actually be invoked. This can prevent linker garbage collection, resulting in unused functions being linked into a binary.

When runtime polymorphism is required, virtual functions should be considered. C alternatives, such as a struct of function pointers, could be used instead, but these approaches may offer no performance advantage while sacrificing flexibility and ease of use.

Only use virtual functions when runtime polymorphism is needed. Lastly try to avoid templated virtual interfaces which can compound the cost by instantiating many virtual tables.

Devirtualization#

Pigweed AI summary: The article suggests keeping devirtualization in mind when using virtual functions and using the "final" declaration in class definitions to make it easier on the compiler and linker. It also provides a link to a more detailed article on devirtualization.

When you do use virtual functions, try to keep devirtualization in mind. You can make it easier on the compiler and linker by declaring class definitions as final to improve the odds. This can help significantly depending on your toolchain.

If you’re interested in more details, this is an interesting deep dive.

Initialization, Constructors, Finalizers, and Destructors#

Pigweed AI summary: The article discusses ways to reduce the costs of constructors and destructors in embedded projects. It suggests making constructors constexpr and removing exit functions, including finalizers registered through atexit, at_quick_exit, and static destructors, to reduce the size. The article also provides specific instructions for disabling static destructors in Clang and GCC with newlib-nano. For the latter, it recommends placing destructors for static global objects in the .fini_array and .fini input sections and explicitly discarding

Constructors#

Pigweed AI summary: The article suggests that constructors should be made constexpr to reduce their costs and enable global instances to be eligible for data or all zeros for bss section placement.

Where possible consider making your constructors constexpr to reduce their costs. This also enables global instances to be eligible for .data or if all zeros for .bss section placement.

Static Destructors And Finalizers#

Pigweed AI summary: For embedded projects where cleaning up after the program is not necessary, the exit functions and static destructors can be removed to reduce the size. Disabling static destructors depends on the toolchain used. With modern versions of Clang, the flag "-fno-C++-static-destructors" can be used. With GCC for ARM Cortex M devices using newlib-nano, the problem can be tackled in two stages. The destructors for the static global objects can be placed in the .fin

For many embedded projects, cleaning up after the program is not a requirement, meaning the exit functions including any finalizers registered through atexit, at_quick_exit, and static destructors can all be removed to reduce the size.

The exact mechanics for disabling static destructors depends on your toolchain.

See the Ignored Finalizer and Destructor Registration section below for further details regarding disabling registration of functions to be run at exit via atexit and at_quick_exit.

Clang#

Pigweed AI summary: The use of "-fno-C++-static-destructors" in modern versions of Clang is a simple solution.

With modern versions of Clang you can simply use -fno-C++-static-destructors and you are done.

GCC with newlib-nano#

Pigweed AI summary: The article discusses the complexities of using GCC with newlib-nano for ARM Cortex M devices. It explains the two stages involved in tackling the problem of destructors for static global objects and scoped static objects. The article also mentions an interesting proposal to enable [[no_destroy]] attributes to C++, which is not yet an option. It suggests using a templated wrapper with placement new to remove destructors from scoped statics. The article provides an example of how to instantiate scoped statics where the destructor will

With GCC this is more complicated. For example with GCC for ARM Cortex M devices using newlib-nano you are forced to tackle the problem in two stages.

First, there are the destructors for the static global objects. These can be placed in the .fini_array and .fini input sections through the use of the -fno-use-cxa-atexit GCC flag, assuming newlib-nano was configured with HAVE_INITFINI_ARAY_SUPPORT. The two input sections can then be explicitly discarded in the linker script through the use of the special /DISCARD/ output section:

/DISCARD/ : {
/* The finalizers are never invoked when the target shuts down and ergo
 * can be discarded. These include C++ global static destructors and C
 * designated finalizers. */
*(.fini_array);
*(.fini);

Second, there are the destructors for the scoped static objects, frequently referred to as Meyer’s Singletons. With the Itanium ABI these use __cxa_atexit to register destruction on the fly. However, if -fno-use-cxa-atexit is used with GCC and newlib-nano these will appear as __tcf_ prefixed symbols, for example __tcf_0.

There’s an interesting proposal (P1247R0) to enable [[no_destroy]] attributes to C++ which would be tempting to use here. Alas this is not an option yet. As mentioned in the proposal one way to remove the destructors from these scoped statics is to wrap it in a templated wrapper which uses placement new.

#include <type_traits>

template <class T>
class NoDestroy {
 public:
  template <class... Ts>
  NoDestroy(Ts&&... ts) {
    new (&static_) T(std::forward<Ts>(ts)...);
  }

  T& get() { return reinterpret_cast<T&>(static_); }

 private:
  std::aligned_storage_t<sizeof(T), alignof(T)> static_;
};

This can then be used as follows to instantiate scoped statics where the destructor will never be invoked and ergo will not be linked in.

Foo& GetFoo() {
  static NoDestroy<Foo> foo(foo_args);
  return foo.get();
}

Strings#

Pigweed AI summary: The article discusses ways to optimize code size, memory usage, I/O traffic, and CPU utilization by replacing strings and printf-style formatted strings with binary tokens during compilation. It suggests using pw_tokenizer instead of directly using strings and printf. The article also recommends using pw_string's utilities instead of the formatted output family of printf functions in <cstdio>. Additionally, it suggests using tokenized backends for logging and asserting, such as pw_log_tokenized coupled with pw_assert_log, to reduce costs. The

Tokenization#

Pigweed AI summary: The article suggests using pw_tokenizer instead of strings and printf to replace them with binary tokens during compilation. This can reduce code size, memory usage, I/O traffic, and CPU utilization. However, caution should be exercised when using string arguments with tokenization as they still result in a string in the binary, which is appended to the token at runtime.

Instead of directly using strings and printf, consider using pw_tokenizer to replace strings and printf-style formatted strings with binary tokens during compilation. This can reduce the code size, memory usage, I/O traffic, and even CPU utilization by replacing snprintf calls with simple tokenization code.

Be careful when using string arguments with tokenization as these still result in a string in your binary which is appended to your token at run time.

String Formatting#

Pigweed AI summary: The printf functions in <cstdio> are expensive in terms of code size and often rely on malloc. Instead, consider using pw_string's utilities where tokenization cannot be used. Removing all printf functions can save more than 5KiB of code size on ARM Cortex M devices using newlib-nano.

The formatted output family of printf functions in <cstdio> are quite expensive from a code size point of view and they often rely on malloc. Instead, where tokenization cannot be used, consider using pw_string’s utilities.

Removing all printf functions often saves more than 5KiB of code size on ARM Cortex M devices using newlib-nano.

Logging & Asserting#

Pigweed AI summary: The article discusses the use of tokenized backends for logging and asserting, which can reduce costs but still have a callsite cost due to arguments and metadata. It suggests avoiding string arguments and unnecessary extra arguments, and adjusting log levels as code stabilizes. The article also mentions future plans for Pigweed to evaluate extra configuration options for finer control over diagnostic value and size cost.

Using tokenized backends for logging and asserting such as pw_log_tokenized coupled with pw_assert_log can drastically reduce the costs. However, even with this approach there remains a callsite cost which can add up due to arguments and including metadata.

Try to avoid string arguments and reduce unnecessary extra arguments where possible. And consider adjusting log levels to compile out debug or even info logs as code stabilizes and matures.

Future Plans#

Pigweed AI summary: Pigweed is considering additional configuration options to allow users to have more control over the diagnostic value and size cost of log arguments for certain log levels and modules.

Going forward Pigweed is evaluating extra configuration options to do things such as dropping log arguments for certain log levels and modules to give users finer grained control in trading off diagnostic value and the size cost.

Threading and Synchronization Cost#

Pigweed AI summary: The article discusses various ways to optimize threading and synchronization cost in software development. It suggests using lighterweight signaling primitives like pw::sync::ThreadNotification instead of semaphores, which can be implemented using more efficient RTOS specific signaling primitives. The article also recommends watermarking stacks to reduce wasted memory and using asynchronous design patterns like Active Objects to permit the sharing of stack allocations. Additionally, it mentions the importance of sizing various buffers in the application and adjusting their servicing interval and priority while keeping the ingress burst

Lighterweight Signaling Primatives#

Pigweed AI summary: The article suggests using pw::sync::ThreadNotification instead of semaphores as it can be implemented using more efficient RTOS specific signaling primitives, such as direct task notifications on FreeRTOS, which are more than 10x smaller than semaphores and faster.

Consider using pw::sync::ThreadNotification instead of semaphores as they can be implemented using more efficient RTOS specific signaling primitives. For example on FreeRTOS they can be backed by direct task notifications which are more than 10x smaller than semaphores while also being faster.

Threads and their stack sizes#

Pigweed AI summary: The article discusses the large stack cost of synchronous APIs and recommends watermarking stacks to reduce wasted memory. It also suggests using asynchronous design patterns such as Active Objects and work queues to effectively share stack allocations. The snapshot integration for RTOSes like pw_thread_freertos and pw_thread_embos comes with built-in support to report stack watermarks for threads if enabled in the kernel.

Although synchronous APIs are incredibly portable and often easier to reason about, it is often easy to forget the large stack cost this design paradigm comes with. We highly recommend watermarking your stacks to reduce wasted memory.

Our snapshot integration for RTOSes such as pw_thread_freertos and pw_thread_embos come with built in support to report stack watermarks for threads if enabled in the kernel.

In addition, consider using asynchronous design patterns such as Active Objects which can use pw_work_queue or similar asynchronous dispatch work queues to effectively permit the sharing of stack allocations.

Buffer Sizing#

Pigweed AI summary: The article discusses the importance of sizing buffers in an application and suggests using watermarking with pw_metric to adjust their servicing interval and priority. It also advises considering ingress burst sizes and scheduling jitter when making adjustments.

We’d be remiss not to mention the sizing of the various buffers that may exist in your application. You could consider watermarking them with pw_metric. You may also be able to adjust their servicing interval and priority, but do not forget to keep the ingress burst sizes and scheduling jitter into account.

Standard C and C++ libraries#

Pigweed AI summary: The article discusses ways to remove bloat from standard C and C++ libraries. It recommends keeping an eye out for common sources of bloat and using the "--wrap" option at link time to replace implementations with cheaper ones. The article also provides examples of how to replace the implementation of the "assert" function, ignored finalizer and destructor registration, and unexpected bloat in disabled STL exceptions.

Toolchains are typically distributed with their preferred standard C library and standard C++ library of choice for the target platform.

Although you do not always have a choice in what standard C library and what standard C++ library is used or even how it’s compiled, we recommend always keeping an eye out for common sources of bloat.

Assert#

Pigweed AI summary: The standard C library provides the assert function or macro, which can be internally used even if the application does not invoke it directly. However, there is typically no portable way of replacing the assert implementation without configuring and recompiling the standard C library. One option to remove bloat is to use --wrap at link time to replace these implementations. This article provides an example of how to replace the expensive __assert_func implementation with a simple PW_CRASH invocation, which can save several kilobytes in

The standard C library should provides the assert function or macro which may be internally used even if your application does not invoke it directly. Although this can be disabled through NDEBUG there typically is not a portable way of replacing the assert(condition) implementation without configuring and recompiling your standard C library.

However, you can consider replacing the implementation at link time with a cheaper implementation. For example newlib-nano, which comes with the GNU Arm Embedded Toolchain, often has an expensive __assert_func implementation which uses fiprintf to print to stderr before invoking abort(). This can be replaced with a simple PW_CRASH invocation which can save several kilobytes in case fiprintf isn’t used elsewhere.

One option to remove this bloat is to use --wrap at link time to replace these implementations. As an example in GN you could replace it with the following BUILD.gn file:

import("//build_overrides/pigweed.gni")

import("$dir_pw_build/target_types.gni")

# Wraps the function called by newlib's implementation of assert from stdlib.h.
#
# When using this, we suggest injecting :newlib_assert via pw_build_LINK_DEPS.
config("wrap_newlib_assert") {
  ldflags = [ "-Wl,--wrap=__assert_func" ]
}

# Implements the function called by newlib's implementation of assert from
# stdlib.h which invokes __assert_func unless NDEBUG is defined.
pw_source_set("wrapped_newlib_assert") {
  sources = [ "wrapped_newlib_assert.cc" ]
  deps = [
    "$dir_pw_assert:check",
    "$dir_pw_preprocessor",
  ]
}

And a wrapped_newlib_assert.cc source file implementing the wrapped assert function:

#include "pw_assert/check.h"
#include "pw_preprocessor/compiler.h"

// This is defined by <cassert>
extern "C" PW_NO_RETURN void __wrap___assert_func(const char*,
                                                  int,
                                                  const char*,
                                                  const char*) {
  PW_CRASH("libc assert() failure");
}

Ignored Finalizer and Destructor Registration#

Pigweed AI summary: During shutdown, even if no cleanup is done for a target, shutdown functions like atexit, at_quick_exit, and __cxa_atexit may not be linked out due to vendor code or the use of scoped statics. The registration of these destructors and finalizers may include locks and malloc, depending on the standard C library and its configuration. To remove this bloat, the --wrap option can be used at link time to replace these implementations with ones that do nothing. An example implementation

Even if no cleanup is done during shutdown for your target, shutdown functions such as atexit, at_quick_exit, and __cxa_atexit can sometimes not be linked out. This may be due to vendor code or perhaps using scoped statics, also known as Meyer’s Singletons.

The registration of these destructors and finalizers may include locks, malloc, and more depending on your standard C library and its configuration.

One option to remove this bloat is to use --wrap at link time to replace these implementations with ones which do nothing. As an example in GN you could replace it with the following BUILD.gn file:

import("//build_overrides/pigweed.gni")

import("$dir_pw_build/target_types.gni")

config("wrap_atexit") {
  ldflags = [
    "-Wl,--wrap=atexit",
    "-Wl,--wrap=at_quick_exit",
    "-Wl,--wrap=__cxa_atexit",
  ]
}

# Implements atexit, at_quick_exit, and __cxa_atexit from stdlib.h with noop
# versions for targets which do not cleanup during exit and quick_exit.
#
# This removes any dependencies which may exist in your existing libc.
# Although this removes the ability for things such as Meyer's Singletons,
# i.e. non-global statics, to register destruction function it does not permit
# them to be garbage collected by the linker.
pw_source_set("wrapped_noop_atexit") {
  sources = [ "wrapped_noop_atexit.cc" ]
}

And a wrapped_noop_atexit.cc source file implementing the noop functions:

// These two are defined by <cstdlib>.
extern "C" int __wrap_atexit(void (*)(void)) { return 0; }
extern "C" int __wrap_at_quick_exit(void (*)(void)) { return 0; }

// This function is part of the Itanium C++ ABI, there is no header which
// provides this.
extern "C" int __wrap___cxa_atexit(void (*)(void*), void*, void*) { return 0; }

Unexpected Bloat in Disabled STL Exceptions#

Pigweed AI summary: The GCC manual recommends using -fno-exceptions and -fno-unwind-tables to disable exceptions and associated overhead, but there is a risk that the STL will still throw exceptions when the application is compiled with -fno-exceptions and there is no way to catch them. This is not unsafe because the unhandled exception will invoke abort() via std::terminate(). However, there can be significant overhead surrounding these throw call sites in the std::__throw_* helper functions. One option to remove

The GCC manual recommends using -fno-exceptions along with -fno-unwind-tables to disable exceptions and any associated overhead. This should replace all throw statements with calls to abort().

However, what we’ve noticed with the GCC and libstdc++ is that there is a risk that the STL will still throw exceptions when the application is compiled with -fno-exceptions and there is no way for you to catch them. In theory, this is not unsafe because the unhandled exception will invoke abort() via std::terminate(). This can occur because the libraries such as libstdc++.a may not have been compiled with -fno-exceptions even though your application is linked against it.

See this for more information.

Unfortunately there can be significant overhead surrounding these throw call sites in the std::__throw_* helper functions. These implementations such as std::__throw_out_of_range_fmt(const char*, ...) and their snprintf and ergo malloc dependencies can very quickly add up to many kilobytes of unnecessary overhead.

One option to remove this bloat while also making sure that the exceptions will actually result in an effective abort() is to use --wrap at link time to replace these implementations with ones which simply call PW_CRASH.

As an example in GN you could replace it with the following BUILD.gn file, note that the mangled names must be used:

import("//build_overrides/pigweed.gni")

import("$dir_pw_build/target_types.gni")

# Wraps the std::__throw_* functions called by GNU ISO C++ Library regardless
# of whether "-fno-exceptions" is specified.
#
# When using this, we suggest injecting :wrapped_libstdc++_functexcept via
# pw_build_LINK_DEPS.
config("wrap_libstdc++_functexcept") {
  ldflags = [
    "-Wl,--wrap=_ZSt21__throw_bad_exceptionv",
    "-Wl,--wrap=_ZSt17__throw_bad_allocv",
    "-Wl,--wrap=_ZSt16__throw_bad_castv",
    "-Wl,--wrap=_ZSt18__throw_bad_typeidv",
    "-Wl,--wrap=_ZSt19__throw_logic_errorPKc",
    "-Wl,--wrap=_ZSt20__throw_domain_errorPKc",
    "-Wl,--wrap=_ZSt24__throw_invalid_argumentPKc",
    "-Wl,--wrap=_ZSt20__throw_length_errorPKc",
    "-Wl,--wrap=_ZSt20__throw_out_of_rangePKc",
    "-Wl,--wrap=_ZSt24__throw_out_of_range_fmtPKcz",
    "-Wl,--wrap=_ZSt21__throw_runtime_errorPKc",
    "-Wl,--wrap=_ZSt19__throw_range_errorPKc",
    "-Wl,--wrap=_ZSt22__throw_overflow_errorPKc",
    "-Wl,--wrap=_ZSt23__throw_underflow_errorPKc",
    "-Wl,--wrap=_ZSt19__throw_ios_failurePKc",
    "-Wl,--wrap=_ZSt19__throw_ios_failurePKci",
    "-Wl,--wrap=_ZSt20__throw_system_errori",
    "-Wl,--wrap=_ZSt20__throw_future_errori",
    "-Wl,--wrap=_ZSt25__throw_bad_function_callv",
  ]
}

# Implements the std::__throw_* functions called by GNU ISO C++ Library
# regardless of whether "-fno-exceptions" is specified with PW_CRASH.
pw_source_set("wrapped_libstdc++_functexcept") {
  sources = [ "wrapped_libstdc++_functexcept.cc" ]
  deps = [
    "$dir_pw_assert:check",
    "$dir_pw_preprocessor",
  ]
}

And a wrapped_libstdc++_functexcept.cc source file implementing each wrapped and mangled std::__throw_* function:

#include "pw_assert/check.h"
#include "pw_preprocessor/compiler.h"

// These are all wrapped implementations of the throw functions provided by
// libstdc++'s bits/functexcept.h which are not needed when "-fno-exceptions"
// is used.

// std::__throw_bad_exception(void)
extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_bad_exceptionv() {
  PW_CRASH("std::throw_bad_exception");
}

// std::__throw_bad_alloc(void)
extern "C" PW_NO_RETURN void __wrap__ZSt17__throw_bad_allocv() {
  PW_CRASH("std::throw_bad_alloc");
}

// std::__throw_bad_cast(void)
extern "C" PW_NO_RETURN void __wrap__ZSt16__throw_bad_castv() {
  PW_CRASH("std::throw_bad_cast");
}

// std::__throw_bad_typeid(void)
extern "C" PW_NO_RETURN void __wrap__ZSt18__throw_bad_typeidv() {
  PW_CRASH("std::throw_bad_typeid");
}

// std::__throw_logic_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_logic_errorPKc(const char*) {
  PW_CRASH("std::throw_logic_error");
}

// std::__throw_domain_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_domain_errorPKc(const char*) {
  PW_CRASH("std::throw_domain_error");
}

// std::__throw_invalid_argument(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_invalid_argumentPKc(
    const char*) {
  PW_CRASH("std::throw_invalid_argument");
}

// std::__throw_length_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_length_errorPKc(const char*) {
  PW_CRASH("std::throw_length_error");
}

// std::__throw_out_of_range(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_out_of_rangePKc(const char*) {
  PW_CRASH("std::throw_out_of_range");
}

// std::__throw_out_of_range_fmt(const char*, ...)
extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_out_of_range_fmtPKcz(
    const char*, ...) {
  PW_CRASH("std::throw_out_of_range");
}

// std::__throw_runtime_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_runtime_errorPKc(
    const char*) {
  PW_CRASH("std::throw_runtime_error");
}

// std::__throw_range_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_range_errorPKc(const char*) {
  PW_CRASH("std::throw_range_error");
}

// std::__throw_overflow_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt22__throw_overflow_errorPKc(
    const char*) {
  PW_CRASH("std::throw_overflow_error");
}

// std::__throw_underflow_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt23__throw_underflow_errorPKc(
    const char*) {
  PW_CRASH("std::throw_underflow_error");
}

// std::__throw_ios_failure(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKc(const char*) {
  PW_CRASH("std::throw_ios_failure");
}

// std::__throw_ios_failure(const char*, int)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKci(const char*,
                                                                  int) {
  PW_CRASH("std::throw_ios_failure");
}

// std::__throw_system_error(int)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_system_errori(int) {
  PW_CRASH("std::throw_system_error");
}

// std::__throw_future_error(int)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_future_errori(int) {
  PW_CRASH("std::throw_future_error");
}

// std::__throw_bad_function_call(void)
extern "C" PW_NO_RETURN void __wrap__ZSt25__throw_bad_function_callv() {
  PW_CRASH("std::throw_bad_function_call");
}

Compiler and Linker Optimizations#

Pigweed AI summary: This article discusses various compiler and linker optimizations for embedded projects. It covers options for optimizing for size, garbage collecting function and data sections, function inlining, and link time optimization (LTO). The benefits and costs of LTO are discussed, and recommendations are given for when to enable it. The article also covers disabling scoped static initialization locks and triaging unexpectedly linked in functions. Overall, the article provides useful tips for optimizing embedded projects.

Compiler Optimization Options#

Pigweed AI summary: This paragraph advises to configure the compiler to optimize for size using "-Oz" for Clang and "-Os" for GCC. The GN toolchains provided by pw_toolchain are already optimized for size and are suffixed with "*_size_optimized".

Don’t forget to configure your compiler to optimize for size if needed. With Clang this is -Oz and with GCC this can be done via -Os. The GN toolchains provided through pw_toolchain which are optimized for size are suffixed with *_size_optimized.

Garbage collect function and data sections#

Pigweed AI summary: The linker can place all functions in an object within the same section, but with Clang and GCC, unique sections can be used for each object. This allows for unused sections to be culled with the "--gc-sections" flag. The GN toolchains provided by pw_toolchain are configured to do this by default. To see what sections were removed, use the "--print-gc-sections" flag.

By default the linker will place all functions in an object within the same linker “section” (e.g. .text). With Clang and GCC you can use -ffunction-sections and -fdata-sections to use a unique “section” for each object (e.g. .text.do_foo_function). This permits you to pass --gc-sections to the linker to cull any unused sections which were not referenced.

To see what sections were garbage collected you can pass --print-gc-sections to the linker so it prints out what was removed.

The GN toolchains provided through pw_toolchain are configured to do this by default.

Function Inlining#

Pigweed AI summary: The article recommends exposing trivial functions as inline definitions in the header, allowing the compiler and linker to decide whether to actually inline the function based on optimization settings. The Google style guide suggests considering this for simple functions that are 10 lines or less. Note that LTO can inline functions not defined in headers.

Don’t forget to expose trivial functions such as member accessors as inline definitions in the header. The compiler and linker can make the trade-off on whether the function should be actually inlined or not based on your optimization settings, however this at least gives it the option. Note that LTO can inline functions which are not defined in headers.

We stand by the Google style guide to recommend considering this for simple functions which are 10 lines or less.

Link Time Optimization (LTO)#

Pigweed AI summary: Link Time Optimization (LTO) can reduce binary size and improve performance for embedded projects, but it comes with trade-offs. LTO can make debugging harder, interact poorly with linker scripts, and produce misleading crash reports. LTO benefits include reducing binary size and improving performance by eliminating function call overhead and reducing the number of instructions. LTO costs include interacting poorly with linker scripts, making debugging harder, producing misleading crash reports, and significantly increasing build times. To enable LTO, pass "-flto

Summary: LTO can decrase your binary size, at a cost: LTO makes debugging harder, interacts poorly with linker scripts, and makes crash reports less informative. We advise only enabling LTO when absolutely necessary.

Link time optimization (LTO) moves some optimizations from the individual compile steps to the final link step, to enable optimizing across translation unit boundaries.

LTO can both increase performance and reduce binary size for embedded projects. This appears to be a strict improvement; and one might think enabling LTO at all times is the best approach. However, this is not the case; in practice, LTO is a trade-off.

LTO benefits

Reduces binary size - When compiling with size-shrinking flags like -Oz, some function call overhead can be eliminated, and code paths might be eliminated by the optimizer after inlining. This can include critical abstraction removal like devirtualization.
Improves performance - When code is inlined, the optimizer can better reduce the number of instructions. When code is smaller, the instruction cache has better hit ratio leading to better performance. In some cases, entire function calls are eliminated.

LTO costs

LTO interacts poorly with linker scripts - Production embedded projects often have complicated linker scripts to control the physical layout of code and data on the device. For example, you may want to put performance critical audio codec functions into the fast tightly coupled (TCM) memory region. However, LTO can interact with linker script requirements in strange ways, like inappropriately inlining code that was manually placed into other functions in the wrong region; leading to hard-to-understand bugs.
Debugging LTO binaries is harder - LTO increases the differences between the machine code and the source code. This makes stepping through source code in a debugger confusing, since the instruction pointer can hop around in confusing ways.
Crash reports for LTO binaries can be misleading - Just as with debugging, LTO’d binaries can produce confusing stacks in crash reports.
LTO significantly increases build times - The compilation model is different when LTO is enabled, since individual translation unit compilations (.cc –> .o) files now produce LLVM- or GCC- IR instead of native machine code; machine code is only generated at the link phase. This makes the final link step take significantly longer. Since any source changes will result in a link step, developer velocity is reduced due to the slow compile time.

How to enable LTO#

Pigweed AI summary: This section explains how to enable LTO (Link Time Optimization) on GCC and Clang by passing the flag "-flto" to both the compiler and linker. Additionally, on GCC, the flag "-fdevirtualize-at-ltrans" can be used to enable more aggressive devirtualization.

On GCC and Clang LTO is enabled by passing -flto to both the compiler and the linker. On GCC -fdevirtualize-at-ltrans enables more aggressive devirtualization.

Our recommendation#

Pigweed AI summary: The recommendation is to disable LTO unless it is absolutely necessary due to lack of space. If enabling LTO, it is important to thoroughly test the resulting binary and ensure that crash reports are still useful for the product.

Disable LTO unless absolutely necessary; e.g. due to lack of space.
When enabling LTO, carefully and thoroughly test the resulting binary.
Check that crash reports are still useful under LTO for your product.

Disabling Scoped Static Initialization Locks#

Pigweed AI summary: The article discusses the issue of thread-safe initialization of scoped static objects in C++11, which is required but often not implemented properly on embedded targets. The use of guard variables to detect reentrant initialization can cause crashes, and the global lock provided by GCC and Clang may not work for embedded targets. The option of disabling the global lock with "-fno-threadsafe-statics" is available, but caution must be taken as it may affect the thread safety of other statics.

C++11 requires that scoped static objects are initialized in a thread-safe manner. This also means that scoped statics, i.e. Meyer’s Singletons, be thread-safe. Unfortunately this rarely is the case on embedded targets. For example with GCC on an ARM Cortex M device if you test for this you will discover that instead the program crashes as reentrant initialization is detected through the use of guard variables.

With GCC and Clang, -fno-threadsafe-statics can be used to remove the global lock which often does not work for embedded targets. Note that this leaves the guard variables in place which ensure that reentrant initialization continues to crash.

Be careful when using this option in case you are relying on threadsafe initialization of statics and the global locks were functional for your target.

Triaging Unexpectedly Linked In Functions#

Pigweed AI summary: This article provides tips for triaging unexpectedly linked in functions. If you are unable to determine why a function is being linked in, you can use the linker with the "--wrap" option to remove the implementation and identify the calling function that can no longer be linked. Additionally, with GCC, you can use "-fcallgraph-info" to inspect the callgraph and determine who is calling what. Symbolizing the address can also help identify the associated object for destructor functions, using tools such as "llvm

Lastly as a tip if you cannot figure out why a function is being linked in you can consider:

Using --wrap with the linker to remove the implementation, resulting in a link failure which typically calls out which calling function can no longer be linked.
With GCC, you can use -fcallgraph-info to visualize or otherwise inspect the callgraph to figure out who is calling what.
Sometimes symbolizing the address can resolve what a function is for. For example if you are using newlib-nano along with -fno-use-cxa-atexit, scoped static destructors are prefixed __tcf_*. To figure out object these destructor functions are associated with, you can use llvm-symbolizer or addr2line and these will often print out the related object’s name.