Aller au contenu

Memory Profiling and Optimization

Introduction

Cortex-M Program Image and Memory

In this codelab, we will first dig into the understanding of a Mbed OS Cortex-M program image and of the Mbed OS memory model. We will then investigate different memory profiling techniques of Mbed OS programs and illustrate some classical issues when dealing with static or dynamic usage of the memory.

What you’ll build

  • You will make modifications in your BikeComputer program for a better understanding of its program image and memory map.
  • You will instrument your BikeComputer program for performing dynamic memory analysis.
  • You will modify the BikeComputer program for creating memory issues on purpose.

What you’ll learn

  • You will understand how a Cortex-M program image is made and how it is used for starting your program on a Cortex-M device.
  • You will understand the boot sequence of a Mbed OS program and how the program memory is configured and initialized.
  • You will understand how a Mbed OS program memory is organized in RAM and how to trace dynamic memory allocations.

What you’ll need

  • Mbed Studio for developing and debugging your program in C++.
  • All BikeComputer and the multi-tasking codelabs are prerequisites for this codelab.

The Program Image

A Cortex-M program image or executable file (e.g. the .elf file on your computer) refers to a piece of code that is ready to execute. The image can occupy up to 512 MiB of memory space, ranging from address 0x00000000 to address 0x1FFFFFFF, as shown in Figure 1 for the Cortex-M7 architecture. The code memory map of the STM32H747 MCU is shown in Figure 2.

Cortex-M7 system memory map

Figure 1: Cortex-M7 System Address Map

STM32H747 code memory map

Figure 2: STM32H47 Code Memory Map

The program image is usually stored in non-volatile memory such as on-chip Flash memory and it is normally separated from the program data, which is allocated in the SRAM or data region of the code memory space.

For building the program image, the linker uses a scatter file that defines its different memory regions. The scatter file of your target device (the “stm32h747xI_CM7.sct” file) is shown below - this file can be easily understood and you may easily recognize the definitions of the ROM and RAM regions for instance:

STM32H747I scatter file
targets/TARGET_STM/TARGET_STM32H7/TARGET_STM32H747xI/TARGET_STM32H747xI_CM7/TOOLCHAIN_ARM/stm32h747xI_CM7.sct
#! armclang -E --target=arm-arm-none-eabi -x c -mcpu=cortex-m7
; Scatter-Loading Description File
;
; SPDX-License-Identifier: BSD-3-Clause
;******************************************************************************
;* @attention
;*
;* Copyright (c) 2016-2020 STMicroelectronics.
;* All rights reserved.
;*
;* This software component is licensed by ST under BSD 3-Clause license,
;* the "License"; You may not use this file except in compliance with the
;* License. You may obtain a copy of the License at:
;*                        opensource.org/licenses/BSD-3-Clause
;*
;******************************************************************************

#include "../cmsis_nvic.h"

#if !defined(MBED_APP_START)
  #define MBED_APP_START  MBED_ROM_START
#endif

#if !defined(MBED_APP_SIZE)
  #define MBED_APP_SIZE  MBED_ROM_SIZE
#endif

#if !defined(MBED_CONF_TARGET_BOOT_STACK_SIZE)
/* This value is normally defined by the tools to 0x1000 for bare metal and 0x400 for RTOS */
#if defined(MBED_BOOT_STACK_SIZE)
#define MBED_CONF_TARGET_BOOT_STACK_SIZE MBED_BOOT_STACK_SIZE
#else
#define MBED_CONF_TARGET_BOOT_STACK_SIZE 0x400
#endif
#endif

/* Round up VECTORS_SIZE to 8 bytes */
#define VECTORS_SIZE  (((NVIC_NUM_VECTORS * 4) + 7) AND ~7)

LR_IROM1  MBED_APP_START  MBED_APP_SIZE  {

  ER_IROM1  MBED_APP_START  MBED_APP_SIZE  {
    *.o (RESET, +First)
    *(InRoot$$Sections)
    .ANY (+RO)
  }

  RW_IRAM1  (MBED_RAM_START)  {  ; RW data
    .ANY (+RW +ZI)
  }

  ARM_LIB_HEAP  AlignExpr(+0, 16)  EMPTY  (MBED_RAM_START + MBED_RAM_SIZE - MBED_CONF_TARGET_BOOT_STACK_SIZE - AlignExpr(ImageLimit(RW_IRAM1), 16))  { ; Heap growing up
  }

  ARM_LIB_STACK  (MBED_RAM_START + MBED_RAM_SIZE)  EMPTY  -MBED_CONF_TARGET_BOOT_STACK_SIZE  { ; Stack region growing down
  }

  RW_DMARxDscrTab 0x30040000 0x60 {
    *(.RxDecripSection)
  }
  RW_DMATxDscrTab 0x30040100 0x140 {
    *(.TxDecripSection)
  }
  RW_Rx_Buffb 0x30040400 0x1800 {
    *(.RxArraySection)
  }
  RW_Eth_Ram 0x30044000 0x4000 {
    *(.ethusbram)
  }

}

Based on this information, the linker produces a program image (CODE region) that corresponds to the map depicted in the figure above. This program image can be better understood by analyzing the elf file produced by the linker. Note that there are a number of tools for analyzing elf files and using them is beyond the scope of this codelab. It is however useful to give some more details about the structure of the program image using the result produced by one of these tools (more specifically the Keil fromelf program as documented on Keil FromElf). The full file showing an example of program image analysis for the BikeComputer program is shown here.

From this file, we can observe that:

  • The Flash memory section containing the program code (denoted “ER_IROM1”) starts at address 0x0800_0000 and has a size of 338400 bytes. The “SHT_EXECINSTR” attribute means that this section contains executable machine instructions. We can check that this corresponds to the Flash memory bank 1 section (part of the Code section) of the target device as shown in Figure 2.
** Section #1 'ER_IROM1' (SHT_PROGBITS) [SHF_ALLOC + SHF_EXECINSTR]
    Size   : 338400 bytes (alignment 8)
    Address: 0x08000000
  • The code region starts with the Vector table and the first entry in the Vector table corresponds to the main stack pointer. As explained in full details below, the first thing that the ARM processor does upon starting is that it fetches whatever is at the address 0x0000_0000 (or 0x0800_0000 for this target), and it assumes that it is the Stack Pointer value. In our case, it corresponds to the address 0x2408_0000 which is defined as being the pointer to the stack end (recall that the stack grows downwards, with a size of 1024 bytes in this case):
** Section #1 'ER_IROM1' (SHT_PROGBITS) [SHF_ALLOC + SHF_EXECINSTR]
    Size   : 338400 bytes (alignment 8)
    Address: 0x08000000

    $d.realdata
    RESET
    __Vectors
        0x08000000:    24080000    ...$    DCD    604504064
...

** Section #5 'ARM_LIB_STACK' (SHT_NOBITS) [SHF_ALLOC + SHF_WRITE]
    Size   : 1024 bytes (alignment 4)
    Address: 0x2407fc00

...

    3091  Image$$ARM_LIB_STACK$$ZI$$Base
                                    0x2407fc00   Gb  Abs   --   Hi 
    3092  Image$$ARM_LIB_STACK$$ZI$$Limit
                                    0x24080000   Gb  Abs   --   Hi 
  • The next entry in the Vector table is the Reset_Handler, which is treated as a jump location for starting the program upon reset. For our BikeComputer program, you may notice that the Reset_Handler resides at 0x0800_04a0. You may also notice that the address 0x0800_0004 has the address of reset handler 0x08000004: 080004a1. Actually the location 0x0800_0004 contains the address 0x0800_04a1, instead of 0x0800_04a0. The LSB is ignored, and assumed as 0 instead of 1, as the value of 1 at LSB indicates a Thumb instruction type. So 0x0800_04a1 will cause the processor to jump to 0x0800_04a0 (address of Reset_Handler).
$d.realdata
    RESET
    __Vectors
        0x08000000:    24080000    ...$    DCD    604504064
        0x08000004:    080004a1    ....    DCD    134218913

...

.text
    $v0
    Reset_Handler
        0x080004a0:    4806        .H      LDR      r0,[pc,#24] ; [0x80004bc] = 0x8008841
        0x080004a2:    4780        .G      BLX      r0
        0x080004a4:    4806        .H      LDR      r0,[pc,#24] ; [0x80004c0] = 0x8000299
        0x080004a6:    4700        .G      BX       r0
  • As you can see from the Elf file analysis, there is a lot of code which is surplus to your ‘main’ code. This surplus information includes the startup code and it is required to put the binary elf file into a format which the ARM architecture will be able to execute.

The startup code and boot sequence are explained in more detail in the next section.

Exercice Memory Profiling and Optimization/1

For this exercice, you need to:

  • Understand the memory map for the target device available in the reference manual, at page 136-137.

  • Open the image analysis document available in the analysis document.

  • Map the AXI SRAM region described in the reference manual with sections described in the analysis document.

Solution
  • Region “AXI SRAM” is used by different sections.
    ** Section #2 'RW_IRAM1' (SHT_PROGBITS) [SHF_ALLOC + SHF_WRITE]
    Size   : 92 bytes (alignment 4)
    Address: 0x24000000
    ...
    ** Section #3 'RW_IRAM1' (SHT_NOBITS) [SHF_ALLOC + SHF_WRITE]
    Size   : 16860 bytes (alignment 8)
    Address: 0x24000168
    ...
    ** Section #4 'ARM_LIB_HEAP' (SHT_NOBITS) [SHF_ALLOC + SHF_WRITE]
    Size   : 506032 bytes (alignment 4)
    Address: 0x24004350
    ...
    ** Section #5 'ARM_LIB_STACK' (SHT_NOBITS) [SHF_ALLOC + SHF_WRITE]
    Size   : 1024 bytes (alignment 4)
    Address: 0x2407fc00
    

The Boot Sequence and Memory Initialization

Upon reset, a startup code is executed by the Cortex-M processor. The startup code is specific to each platform and toolchain, but it usually consists of

  1. setting the initial SP,
  2. setting the initial PC to the Reset_Handler value,
  3. setting the vector table entries with the exceptions ISR addresses, and
  4. branching to __main in the C library, which eventually calls the main() function of your program.

Note that after Reset the Cortex-M processor is in “Thread” mode, priority is “Privileged”, and the Stack is set to Main.

Before the user main() function is executed, the __main startup function is executed at the start of the binary executable. This function calls other functions and is the real entry point of the user’s program. This __main function is pre-defined (though the programmers can write their own __main) and it is different from the main() function in the user’s C-program.

The __main startup function calls the __rt_entry function, which is defined in the “mbed_boot_arm_std.c” file (located in the cmsis/device/rtos/TOOLCHAIN_ARM_STD folder). This function initializes the stack and heap addresses, initializes and starts Mbed OS - which ultimately calls your main() function.

Load Address vs Execution Address

The BikeComputer program written above contains application code and data constants. When the compiled version of application code and data is put into the memory of a microcontroller, we may differentiate between regions for which the load address is its execution address, and those for which the addresses are different. The regions for which addresses are different requires relocation.

In a typical embedded system, all the program and data is stored in some non-volatile memory when the system is powered off. However, when the system is powered-on, some of the data or code may be moved into system SRAM (volatile memory), before it is executed (if code) or before it is used (if data).

As explained above, at link time, an image of the program is produced. This is the binary executable file which the system can execute. The binary image is typically divided into different segments that are either read-only (containing code and read-only data) or read-write regions (containing data, which can be initialized or zero initialized or uninitialized).

Usually the read-only segment is placed in non-volatile memory and does not have a requirement to be moved from where it is in the memory. We may say that it is executed from where it is, i.e. it is executed in place. To the contrary, the read-write segments must often be moved into the system’s fast read-write memory (e.g. SRAM) before the execution begins.

Hence for certain parts of the image, the memory location where that part resides when the system is powered off is the same when the system is powered on. But for certain parts of the image, the memory location where that part resides when the system is powered off is different to the memory location where that part is moved to when the system is powered on. So this code must be moved and relocated at startup. And in this case, we say that the load address and the execution address are different.

The linker will add the code into the program which the processor will execute for moving those parts of the code, which are required to be moved into the system’s SRAM at power-up. This relocation code is executed at startup.

In summary, the full sequence of the execution of the program may depend on the specific platform and toolchain, but it is always something like this:

  1. Stack Pointer SP is loaded from whatever the contents of the memory are at 0x0000_0000 (0x0800_0000 for our target)
  2. Program Counter of the processor is loaded to the location of Reset_Handler, this location will be present at the memory location 0x0000_0004 (0x0800_0004 for our target).
  3. Reset_Handler is platform specific but it is mainly a jump to __main. On our target, the Reset_Handler function calls the SystemInit function (at address 0x08008840) and then the __main function (at address 0x08000298).
  4. __main first calls __scatterload. The role of the __scatterload function is the initialization of memory (__scatterload_null), the initialization of ZI (Zero Initialization) regions to 0 (__scatterload_zero_init) and the load of regions requiring relocation to execution addresses.
  5. __main then calls __rt_entry, as explained above.
  6. The main() function from the user application is then called. On Mbed OS, the main() function is executed in a thread called the main thread.
  7. __rt_lib_shutdown is called when the main() function exits - which usually never happens.

The startup and initialization steps are explained in all details in the following document. Note that these steps refer to Cortex-M processors using the ARM toolchain and these steps may be different when using another MCU or another toolchain.

Exercice Memory Profiling and Optimization/2

For this exercice, you need to:

  • Read the document explaining the c library startup and understand where the __main function is called from the bootup sequence in the elf file.

  • Find the definition of the __rt_entry() function in the Mbed OS library.

  • From the __rt_entry() code, understand the subsequent initialization steps, for instance how the heap is initialized.

  • Understand where and how the call to your main() program function is made and how the stack for the main thread is set.

Solution
  • The __rt_entry() is defined in the “mbed-os\cmsis\device\rtos\TOOLCHAIN_ARM_STD\mbed_boot_arm_std.c” file.
  • The __rt_entry() function initializes the stack and heap start pointers and sizes. It then calls mbed_init() and ultimately mbed_rtos_start.
  • The mbed_rtos_start() function creates a thread named "main" that is launched by executing the mbed_start function. In the mbed_start function, the user main() function is ultimately called.
  • The mbed_rtos_start() function ultimately calls the osKernelStart function that launches the scheduler.

Static Memory Analysis Using memap

For understanding how the program image is structured and how the memory space is used, Mbed OS provides a simple utility tool called memap that displays static memory information required by any Mbed OS application. This information is produced by analyzing the memory map file previously generated by your toolchain. Memap is automatically run at the end of each build operation and you can read the result in the Output window as shown below:

Memap output

| Module                                 |       .text |   .data |      .bss |
|----------------------------------------|-------------|---------|-----------|
...
| advdembsof_library\display             |  165494(+0) |   0(+0) |   528(+0) |
| advdembsof_library\sensors             |     735(+0) |   0(+0) |     0(+0) |
| advdembsof_library\utils               |    1077(+0) |  24(+0) |     0(+0) |
...
| common\sensor_device.o                 |      86(+0) |   0(+0) |     0(+0) |
| common\speedometer.o                   |     698(+0) |   0(+0) |     0(+0) |
| disco_h747i\CM7                        |      52(+0) |   0(+0) |     0(+0) |
| disco_h747i\Drivers                    |    4801(+0) |   0(+0) |   585(+0) |
| disco_h747i\Wrappers                   |   77320(+0) |   0(+0) |   517(+0) |
| main.o                                 |     111(+0) |   0(+0) |     0(+0) |
...
| multi_tasking\bike_system.o            |    2371(+0) |   0(+0) |     0(+0) |
| multi_tasking\gear_device.o            |     774(+0) |   0(+0) |     0(+0) |
| multi_tasking\pedal_device.o           |     776(+0) |   0(+0) |     0(+0) |
| multi_tasking\reset_device.o           |      46(+0) |   0(+0) |     0(+0) |
| Subtotals                              | 334292(-54) | 357(+0) | 17836(+0) |

Total Static RAM memory (data + bss): 18193(+0) bytes
Total Flash memory (text + data): 334649(-54) bytes

In the image above, the meaning of the .text, .data and .bss sections is the following:

  • .text: is where the code application and constants are located in Flash.
  • .data: nonzero initialized variables; allocated in both RAM and Flash memory (variables are copied from Flash to RAM at runtime).
  • .bss: uninitialized data allocated in RAM, or variables initialized to zero.

Note that in this view, the numbers in parentheses (e.g. (-54)) indicate the changes in sizes (number of bytes) since the last build. This is a very useful tool for understanding the changes introduced in the code since the last build.

For a better understanding of what memory goes where, add the following code in your “main.cpp” file

main_modified.cpp
...
const char szMsg[] = "This is a test message";
static constexpr uint8_t size = 10;
uint32_t randomArray[size] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
uint32_t randomNumber = 0;
...

int main() {
  ...

  tr_info(szMsg);
  for (uint8_t i = 0; i < size; i++) {
    randomArray[i] = rand();
    tr_info("This is a random number %d", randomArray[i]);
  }
  randomNumber = rand();
  tr_info("This is a random number %d", randomNumber);

  ...
}

Exercice Memory Profiling and Optimization/3

Observe the change in the memory map for each individual change documented above, for the “main.o” object file. Look at how each .text, .data and .bss section is modified for each change and give an explanation.

Solution
  • szMsg: the size of the .text section grows by 44 bytes. Both the additional call to tr_info and the szMsg are allocated in the code section.
  • randomArray: the size of the .text/.data sections grows by 81/40 bytes. The additional code goes into the .text region, while randomArray goes into the .data section (10 x int = 40 bytes) (nonzero initialized variables).
  • randomNumber: the size of the .text/.bss sections grows by 61/4 bytes. The additional code goes into the .text region, while randomNumber goes into the .bss section (1 int = 4 bytes) (zero initialized variables).

For observing the changes in the program image for the code, we may perform an analysis of the “bike_computer.elf” file. If we do so, what we can observe is the following:

  • The constant string szMsg is stored in the constant data section of the program image

    address     size       variable name                            type
    0x0803fee4  0x17       szMsg                                    array[23] of const char
    
    _ZL5szMsg
        0x0803fee4:    73696854    This    DCD    1936287828
        0x0803fee8:    20736920     is     DCD    544434464
        0x0803feec:    65742061    a te    DCD    1702109281
        0x0803fef0:    6d207473    st m    DCD    1830843507
        0x0803fef4:    61737365    essa    DCD    1634956133
        0x0803fef8:    6567        ge      DCW    25959
        0x0803fefa:    00          .       DCB    0
    

  • The random integer array randomArray is allocated in the initialized static data section (RAM section).

    address     size       variable name                            type
    0x24000160  0x28       randomArray                              array[10] of uint32_t
    

The variable randomArray is initialized with the values defined in the “main.cpp” file at startup. This is done in the __scatterload function that goes through the region table and initializes the various execution-time regions. As already mentioned, this function initializes the Zero Initialized (ZI) regions to zero and copies or decompresses the non-root code and data region from their load-time locations to the execute-time regions.

  • The global variable randomNumber is allocated in the .bss section.
    address     size       variable name                            type
    0x24003d88  0x4        randomNumber                             uint32_t
    

Run the memory map analyzer

You can also run the memory map analyzer at any time by running the command “python “MbedStudioInstallDir”\library-pipeline\mbed-os\tools\memap.py” with the appropriate set of parameters. While running the analyzer separately, you may choose the directory depth level for displaying the memory analysis report (by default 2).

Another more interactive way of displaying the memory map information is available through Linker-Report. This tool allows you to display the memory map information in a visual and interactive way as demonstrated on interactive memory map. Install the utility as documented on Linker-Report and build an interactive memory map of your BikeComputer program. It should look like the BikeComputer Interactive Map.

Reducing Memory Usage by Tuning the Mbed OS Configuration

Both flash memory and RAM sizes are limited on most microcontrollers. Reducing the memory footprint of an application can help you squeeze in more features or reduce cost. This can be done by replacing standard I/O calls with a smaller implementation.

For the printf function and in particular if you are using a tracing library with precompiler options, the easiest way of reducing the size of the binary is to exclude all printf calls in a release build. But, while debugging an application, doing logging is an essential feature. In this case, switching to versions of stdio libraries with reduced footprint is a good alternative. You can do this by changing the printf library in your application by modifying the “mbed_app.json” file:

mbed_app.json
"target_overrides": {
  "*": {
    "target.printf_lib": "std"
  }
}
vs.

mbed_app.json
"target_overrides": {
  "*": {
      "target.printf_lib": "minimal-printf"
  }
}
You can even further optimize the footprint by enabling floating point in printf only when necessary:

mbed_app.json
"target_overrides": {
  "*": {
      "platform.minimal-printf-enable-floating-point": true,
      "platform.minimal-printf-set-floating-point-max-decimals": 6,
  }
}

The minimal-printf library supports both printf and sprintf in 1252 bytes of flash. An interesting comparison of the size of the blinky program compiled with different options is available on minimal-printf.

The memory usage of an application can be further optimized by tuning the Mbed OS configuration to a specific application’s needs. If an application doesn’t need all the features of Mbed OS, the memory usage can be reduced by reducing the number of tasks, by decreasing the thread stack sizes or by disabling user timers. The Mbed OS configuration parameters can be modified in the “mbed_app.json”. The parameters available for configuration can be listed with the “mbed compile –config command -t ARMC6”. Note that you can also run the command for the “GCC_ARM” toolchain. If you run this command you should get an output similar to

Available configuration parameters

rtos-api.present = 1 (macro name: "MBED_CONF_RTOS_API_PRESENT")
rtos.evflags-num = 0 (macro name: "MBED_CONF_RTOS_EVFLAGS_NUM")
rtos.idle-thread-stack-size = 512 (macro name: "MBED_CONF_RTOS_IDLE_THREAD_STACK_SIZE")
rtos.idle-thread-stack-size-debug-extra = 128 (macro name: "MBED_CONF_RTOS_IDLE_THREAD_STACK_SIZE_DEBUG_EXTRA")
rtos.idle-thread-stack-size-tickless-extra = 256 (macro name: "MBED_CONF_RTOS_IDLE_THREAD_STACK_SIZE_TICKLESS_EXTRA")
rtos.main-thread-stack-size = 4096 (macro name: "MBED_CONF_RTOS_MAIN_THREAD_STACK_SIZE")
rtos.msgqueue-data-size = 0 (macro name: "MBED_CONF_RTOS_MSGQUEUE_DATA_SIZE")
rtos.msgqueue-num = 0 (macro name: "MBED_CONF_RTOS_MSGQUEUE_NUM")
rtos.mutex-num = 0 (macro name: "MBED_CONF_RTOS_MUTEX_NUM")
rtos.present = 1 (macro name: "MBED_CONF_RTOS_PRESENT")
rtos.semaphore-num = 0 (macro name: "MBED_CONF_RTOS_SEMAPHORE_NUM")
rtos.thread-num = 0 (macro name: "MBED_CONF_RTOS_THREAD_NUM")
rtos.thread-stack-size = 4096 (macro name: "MBED_CONF_RTOS_THREAD_STACK_SIZE")
rtos.thread-user-stack-size = 0 (macro name: "MBED_CONF_RTOS_THREAD_USER_STACK_SIZE")
rtos.timer-num = 0 (macro name: "MBED_CONF_RTOS_TIMER_NUM")
rtos.timer-thread-stack-size = 768 (macro name: "MBED_CONF_RTOS_TIMER_THREAD_STACK_SIZE")

All macros displayed above can be modified in the “mbed_app.json” for an optimized use of the Mbed OS configuration for a specific application. One may for instance reduce the user or main stack size.

Exercice Memory Profiling and Optimization/4

Compile your application for using the standard printf library and the minimal-printf library, and compare the size of applications.

Solution

You should observer a reduction of the .text region of approximately 3000-4000 bytes, when compiling with minimal-printf rather than std. Other sections are not impacted.

Runtime Memory Tracing

Static memory analysis is required and powerful for analyzing how the program memory is organized at compile time. However, it is also very useful to analyze how an embedded software deals with dynamic memory allocations, both for the heap and stack memory. A program that behaves poorly in terms of dynamic memory allocations will become unstable and will potentially crash.

With Mbed OS, the developer can use memory statistics functions to capture heap use, cumulative stack use or stack use for each thread at runtime. To enable memory use monitoring, you must enable the following Mbed OS configuration options:

mbed_app.json
{
  "target_overrides": {
      "*": {
          "platform.heap-stats-enabled": true,
          "platform.stack-stats-enabled": true
      }
  }
}

Alternatively, you may also enable all Mbed OS stats at once:

mbed_app.json
{
  "target_overrides": {
      "*": {
          "platform.all-stats-enabled": true
      }
  }
}

Once you enable memory statistics, you may instrument the code and do memory checks at regular intervals or upon requests. This can be implemented with the help of the MemoryLogger class provided with the “advembsof” library.

MemoryLogger declaration
utils/memory_logger.hpp
// Copyright 2022 Haute école d'ingénierie et d'architecture de Fribourg
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

/****************************************************************************
 * @file memory_logger.hpp
 * @author Serge Ayer <serge.ayer@hefr.ch>
 *
 * @brief Memory logger header file
 *
 * @date 2023-08-20
 * @version 1.0.0
 ***************************************************************************/

#pragma once

#include "mbed.h"

namespace advembsof {

#if defined(MBED_ALL_STATS_ENABLED)

class MemoryLogger {
   public:
    // methods used by owners
    void getAndPrintStatistics();
    void printDiffs();
    void printRuntimeMemoryMap();

    void getAndPrintHeapStatistics();
    void getAndPrintStackStatistics();
    void getAndPrintThreadStatistics();

   private:
    // data members
    static constexpr uint8_t kMaxThreadInfo         = 10;
    mbed_stats_heap_t _heapInfo                     = {0};
    mbed_stats_stack_t _stackInfo[kMaxThreadInfo]   = {0};
    mbed_stats_stack_t _globalStackInfo             = {0};
    mbed_stats_thread_t _threadInfo[kMaxThreadInfo] = {0};
};

#endif  // MBED_ALL_STATS_ENABLED

}  // namespace advembsof
MemoryLogger implementation
utils/memory_logger.cpp
// Copyright 2022 Haute école d'ingénierie et d'architecture de Fribourg
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

/****************************************************************************
 * @file memory_logger.cpp
 * @author Serge Ayer <serge.ayer@hefr.ch>
 *
 * @brief Memory logger implementation
 *
 * @date 2023-08-20
 * @version 1.0.0
 ***************************************************************************/

#include "memory_logger.hpp"

#include "mbed_trace.h"
#if MBED_CONF_MBED_TRACE_ENABLE
#define TRACE_GROUP "MemoryLogger"
#endif  // MBED_CONF_MBED_TRACE_ENABLE

#if defined(MBED_ALL_STATS_ENABLED)
extern unsigned char* mbed_stack_isr_start;
extern uint32_t mbed_stack_isr_size;
extern unsigned char* mbed_heap_start;
extern uint32_t mbed_heap_size;
#endif  // MBED_ALL_STATS_ENABLED

namespace advembsof {

#if defined(MBED_ALL_STATS_ENABLED)

void MemoryLogger::printDiffs() {
    {
        tr_debug("MemoryStats (Heap):");
        mbed_stats_heap_t heapInfo = {0};
        mbed_stats_heap_get(&heapInfo);
        uint32_t currentSizeDiff = heapInfo.current_size - _heapInfo.current_size;
        if (currentSizeDiff > 0) {
            tr_debug("\tBytes allocated increased by %" PRIu32 " to %" PRIu32 " bytes",
                     currentSizeDiff,
                     heapInfo.current_size);
        }
        uint32_t maxSizeDiff = heapInfo.max_size - _heapInfo.max_size;
        if (maxSizeDiff > 0) {
            tr_debug("\tMax bytes allocated at a given time increased by %" PRIu32
                     " to %" PRIu32 " bytes (max heap size is %" PRIu32 " bytes)",
                     maxSizeDiff,
                     heapInfo.max_size,
                     heapInfo.reserved_size);
        }
        _heapInfo = heapInfo;
    }
    {
        mbed_stats_stack_t globalStackInfo = {0};
        mbed_stats_stack_get(&globalStackInfo);
        tr_debug("Cumulative Stack Info:");
        uint32_t maxSizeDiff = globalStackInfo.max_size - _globalStackInfo.max_size;
        if (maxSizeDiff > 0) {
            tr_debug("\tMaximum number of bytes used on the stack increased by %" PRIu32
                     " to %" PRIu32 " bytes (stack size is %" PRIu32 " bytes)",
                     maxSizeDiff,
                     globalStackInfo.max_size,
                     globalStackInfo.reserved_size);
        }
        uint32_t stackCntDiff = globalStackInfo.stack_cnt - _globalStackInfo.stack_cnt;
        if (stackCntDiff > 0) {
            tr_debug("\tNumber of stacks stats accumulated increased by %" PRIu32
                     " to %" PRIu32 "",
                     stackCntDiff,
                     globalStackInfo.stack_cnt);
        }
        _globalStackInfo = globalStackInfo;

        mbed_stats_stack_t stackInfo[kMaxThreadInfo] = {0};
        mbed_stats_stack_get_each(stackInfo, kMaxThreadInfo);
        tr_debug("Thread Stack Info:");
        for (uint32_t i = 0; i < kMaxThreadInfo; i++) {
            if (stackInfo[i].thread_id != 0) {
                for (uint32_t j = 0; j < kMaxThreadInfo; j++) {
                    if (stackInfo[i].thread_id == _stackInfo[j].thread_id) {
                        maxSizeDiff = stackInfo[i].max_size - _stackInfo[j].max_size;
                        if (maxSizeDiff > 0) {
                            tr_debug("\tThread: %" PRIu32 "", j);
                            tr_debug(
                                "\t\tThread Id: 0x%08" PRIx32 " with name %s",
                                _stackInfo[j].thread_id,
                                osThreadGetName((osThreadId_t)_stackInfo[j].thread_id));
                            tr_debug(
                                "\t\tMaximum number of bytes used on the stack increased "
                                "by %" PRIu32 " to %" PRIu32
                                " bytes (stack size is %" PRIu32 " bytes)",
                                maxSizeDiff,
                                stackInfo[i].max_size,
                                stackInfo[i].reserved_size);
                        }
                        _stackInfo[j] = stackInfo[i];
                    }
                }
            }
        }
    }
}

void MemoryLogger::getAndPrintHeapStatistics() {
    tr_debug("MemoryStats (Heap):");
    mbed_stats_heap_get(&_heapInfo);
    tr_debug("\tBytes allocated currently: %" PRIu32 "", _heapInfo.current_size);
    tr_debug("\tMax bytes allocated at a given time: %" PRIu32 "", _heapInfo.max_size);
    tr_debug("\tCumulative sum of bytes ever allocated: %" PRIu32 "",
             _heapInfo.total_size);
    tr_debug("\tCurrent number of bytes allocated for the heap: %" PRIu32 "",
             _heapInfo.reserved_size);
    tr_debug("\tCurrent number of allocations: %" PRIu32 "", _heapInfo.alloc_cnt);
    tr_debug("\tNumber of failed allocations: %" PRIu32 "", _heapInfo.alloc_fail_cnt);
}

void MemoryLogger::getAndPrintStackStatistics() {
    mbed_stats_stack_get(&_globalStackInfo);
    tr_debug("Cumulative Stack Info:");
    tr_debug("\tMaximum number of bytes used on the stack: %" PRIu32 "",
             _globalStackInfo.max_size);
    tr_debug("\tCurrent number of bytes allocated for the stack: %" PRIu32 "",
             _globalStackInfo.reserved_size);
    tr_debug("\tNumber of stacks stats accumulated in the structure: %" PRIu32 "",
             _globalStackInfo.stack_cnt);

    mbed_stats_stack_get_each(_stackInfo, kMaxThreadInfo);
    tr_debug("Thread Stack Info:");
    for (uint32_t i = 0; i < kMaxThreadInfo; i++) {
        if (_stackInfo[i].thread_id != 0) {
            tr_debug("\tThread: %" PRIu32 "", i);
            tr_debug("\t\tThread Id: 0x%08" PRIx32 " with name %s",
                     _stackInfo[i].thread_id,
                     osThreadGetName((osThreadId_t)_stackInfo[i].thread_id));
            tr_debug("\t\tMaximum number of bytes used on the stack: %" PRIu32 "",
                     _stackInfo[i].max_size);
            tr_debug("\t\tCurrent number of bytes allocated for the stack: %" PRIu32 "",
                     _stackInfo[i].reserved_size);
            tr_debug("\t\tNumber of stacks stats accumulated in the structure: %" PRIu32
                     "",
                     _stackInfo[i].stack_cnt);
        }
    }
}

void MemoryLogger::getAndPrintThreadStatistics() {
    static const char* state[] = {"Ready", "Running", "Waiting"};
    mbed_stats_thread_get_each(_threadInfo, kMaxThreadInfo);
    tr_debug("Thread Info:");
    for (uint32_t i = 0; i < kMaxThreadInfo; i++) {
        if (_threadInfo[i].id != 0) {
            tr_debug("\tThread: %" PRIu32 "", i);
            tr_debug("\t\tThread Id: 0x%08" PRIx32 " with name %s, state %s, priority %" PRIu32 "",
                     _threadInfo[i].id,
                     _threadInfo[i].name,
                     state[_threadInfo[i].state - 1],
                     _threadInfo[i].priority);
            tr_debug("\t\tStack size %" PRIu32 " (free bytes remaining %" PRIu32 ")",
                     _threadInfo[i].stack_size,
                     _threadInfo[i].stack_space);
        }
    }
}

void MemoryLogger::getAndPrintStatistics() {
    getAndPrintHeapStatistics();
    getAndPrintStackStatistics();
    getAndPrintThreadStatistics();
}

void MemoryLogger::printRuntimeMemoryMap() {
    // defined in rtx_thread.c
    // uint32_t osThreadEnumerate (osThreadId_t *thread_array, uint32_t array_items)
    tr_debug("Runtime Memory Map:");
    osThreadId_t threadIdArray[kMaxThreadInfo] = {0};
    uint32_t nbrOfThreads = osThreadEnumerate(threadIdArray, kMaxThreadInfo);
    for (uint32_t threadIndex = 0; threadIndex < nbrOfThreads; threadIndex++) {
        osRtxThread_t* pThreadCB =
            // cppcheck-suppress cstyleCast
            (osRtxThread_t*)threadIdArray[threadIndex];  // NOLINT(readability/casting)
        uint8_t state             = pThreadCB->state & osRtxThreadStateMask;
        const char* szThreadState = (state & osThreadInactive)     ? "Inactive"
                                    : (state & osThreadReady)      ? "Ready"
                                    : (state & osThreadRunning)    ? "Running"
                                    : (state & osThreadBlocked)    ? "Blocked"
                                    : (state & osThreadTerminated) ? "Terminated"
                                                                   : "Unknown";
        tr_debug("\t thread with name %s, stack_start: %p, stack_end: %p, size: %" PRIu32
                 ", priority: %" PRIu8 ", state: %s",
                 pThreadCB->name,
                 pThreadCB->stack_mem,
                 // cppcheck-suppress cstyleCast
                 (char*)pThreadCB->stack_mem +  // NOLINT(readability/casting)
                     pThreadCB->stack_size,
                 pThreadCB->stack_size,
                 pThreadCB->priority,
                 szThreadState);
    }
    tr_debug("\t mbed_heap_start: %p, mbed_heap_end: %p, size: %" PRIu32 "",
             mbed_heap_start,
             (mbed_heap_start + mbed_heap_size),
             mbed_heap_size);
    tr_debug("\t mbed_stack_isr_start: %p, mbed_stack_isr_end: %p, size: %" PRIu32 "",
             mbed_stack_isr_start,
             (mbed_stack_isr_start + mbed_stack_isr_size),
             mbed_stack_isr_size);
}

#endif  // MBED_ALL_STATS_ENABLED

}  // namespace advembsof

The BikeComputer class can create a MemoryLogger attribute for logging the memory state of the program. It can call the MemoryLogger::getAndPrintStatistics() method in the BikeSystem::start() method and then call the MemoryLogger::printDiffs() methods at regular intervals. By doing so, one should get the following output from the memory logger at startup:

Memory logger: getAndPrintStatistics
[DBG ][MemoryLogger]: MemoryStats (Heap):
[DBG ][MemoryLogger]:   Bytes allocated currently: 8724
[DBG ][MemoryLogger]:   Max bytes allocated at a given time: 8724
[DBG ][MemoryLogger]:   Cumulative sum of bytes ever allocated: 8724
[DBG ][MemoryLogger]:   Current number of bytes allocated for the heap: 505772
[DBG ][MemoryLogger]:   Current number of allocations: 11
[DBG ][MemoryLogger]:   Number of failed allocations: 0
[DBG ][MemoryLogger]: Cumulative Stack Info:
[DBG ][MemoryLogger]:   Maximum number of bytes used on the stack: 3328
[DBG ][MemoryLogger]:   Current number of bytes allocated for the stack: 13952
[DBG ][MemoryLogger]:   Number of stacks stats accumulated in the structure: 4
[DBG ][MemoryLogger]: Thread Stack Info:
[DBG ][MemoryLogger]:   Thread: 0
[DBG ][MemoryLogger]:           Thread Id: 0x240036b8 with name main
[DBG ][MemoryLogger]:           Maximum number of bytes used on the stack: 2648
[DBG ][MemoryLogger]:           Current number of bytes allocated for the stack: 8192
[DBG ][MemoryLogger]:           Number of stacks stats accumulated in the structure: 1
[DBG ][MemoryLogger]:   Thread: 1
[DBG ][MemoryLogger]:           Thread Id: 0x24003630 with name rtx_idle
[DBG ][MemoryLogger]:           Maximum number of bytes used on the stack: 320
[DBG ][MemoryLogger]:           Current number of bytes allocated for the stack: 896
[DBG ][MemoryLogger]:           Number of stacks stats accumulated in the structure: 1
[DBG ][MemoryLogger]:   Thread: 2
[DBG ][MemoryLogger]:           Thread Id: 0x24003674 with name rtx_timer
[DBG ][MemoryLogger]:           Maximum number of bytes used on the stack: 96
[DBG ][MemoryLogger]:           Current number of bytes allocated for the stack: 768
[DBG ][MemoryLogger]:           Number of stacks stats accumulated in the structure: 1
[DBG ][MemoryLogger]:   Thread: 3
[DBG ][MemoryLogger]:           Thread Id: 0x240026c0 with name deferredISRThread
[DBG ][MemoryLogger]:           Maximum number of bytes used on the stack: 264
[DBG ][MemoryLogger]:           Current number of bytes allocated for the stack: 4096
[DBG ][MemoryLogger]:           Number of stacks stats accumulated in the structure: 1
...

and the following output when printing memory changes:

MemoryLogger: printDiffs
[DBG ][MemoryLogger]: MemoryStats (Heap):
[DBG ][MemoryLogger]:   Bytes allocated increased by 16 to 8740 bytes
[DBG ][MemoryLogger]:   Max bytes allocated at a given time increased by 40 to 8764 bytes (max heap size is 505772 bytes)
[DBG ][MemoryLogger]: Cumulative Stack Info:
[DBG ][MemoryLogger]:   Maximum number of bytes used on the stack increased by 280 to 3608 bytes (stack size is 13952 bytes)
[DBG ][MemoryLogger]: Thread Stack Info:
[DBG ][MemoryLogger]:   Thread: 0
[DBG ][MemoryLogger]:           Thread Id: 0x240036b8 with name main
[DBG ][MemoryLogger]:           Maximum number of bytes used on the stack increased by 280 to 2928 bytes (stack size is 8192 bytes)

By performing a detailed dynamic memory analysis, it is then possible to optimize some parameters, such as reducing the allocated stack size for a given thread or optimizing the use of the heap.

One further possibility for getting runtime memory information and logging the memory location of the heap and of the stack of each thread is to use the RTX API directly. The code in the MemoryLogger::printRuntimeMemoryMap() method shows how to log some additional runtime memory information.

This static method uses the Thread Control Block (or TCB) structure osRtxThread_t defined in “rtx_os.h”. The TCB structure stores all information about a thread that is used by the OS for switching the context from one thread to another. If you execute this method from your BikeComputer program, you should observe the following output on the console:

MemoryLogger: printRuntimeMemoryMap
[DBG ][MemoryLogger]: Runtime Memory Map:
[DBG ][MemoryLogger]:    thread with name main, stack_start: 0x24000D28, stack_end: 0x24002D28, size: 8192, priority: 24, state: Running
[DBG ][MemoryLogger]:    thread with name rtx_idle, stack_start: 0x24003700, stack_end: 0x24003A80, size: 896, priority: 1, state: Ready
[DBG ][MemoryLogger]:    thread with name rtx_timer, stack_start: 0x24003A80, stack_end: 0x24003D80, size: 768, priority: 40, state: Ready
[DBG ][MemoryLogger]:    thread with name deferredISRThread, stack_start: 0x24005730, stack_end: 0x24006730, size: 4096, priority: 24, state: Ready
[DBG ][MemoryLogger]:    mbed_heap_start: 0x24004454, mbed_heap_end: 0x2407FC00, size: 505772
[DBG ][MemoryLogger]:    mbed_stack_isr_start: 0x2407FC00, mbed_stack_isr_end: 0x24080000, size: 1024

Note that the method also prints the thread priorities and state. You can observe that the logging is executed from the main thread which is the active thread.

Exercice Memory Profiling and Optimization/5

Instrument the dynamic memory usage of your BikeComputer program with the use of the MemoryLogger class. Use both the MemoryLogger::getAndPrintStatistics at startup and MemoryLogger::printDiffs method at regular intervals.
After startup, you should observe that your program does not allocate any memory on the heap and that the stack use is also not growing anymore.

By observing the statistics on the console, you should understand how the displayed values match your BikeComputer implementation (including the Mbed OS configuration such as the stack size of the different threads).

Hunting For Memory Bugs

Detecting a Heap Allocation Error (Memory Leak)

For illustrating analysis of the heap memory, one practical example is the introduction of a memory leak in the code. A memory leak is created when memory allocations are managed in such a way that memory which is NO longer needed is NOT released. For this purpose, you may add a call for allocating memory and not releasing it in a method called at regular intervals. Be aware that allocating memory without using it is not enough, since the compiler will optimize your code and remove unused statements (like allocating an array and only assigning values to the array elements).

If you create a memory leak by creating an instance of the class MemoryLeak below in one of the task method your BikeComputer program and let your program run, you should observe that the allocated memory on the heap grows constantly and ultimately you should observe a crash as illustrated in the log below:

MemoryLeak class
multi-tasking/memory_leak.hpp
#pragma once

#include "mbed.h"

namespace multi_tasking {

class MemoryLeak {
   public:
    static constexpr uint16_t kArraySize = 1024;

    // create a memory leak in the constructor itself
    MemoryLeak() { _ptr = new int[kArraySize]; }

    void use() {
        for (uint16_t i = 0; i < kArraySize; i++) {
            _ptr[i] = i;
        }
    }

   private:
    int* _ptr;
};

}  // namespace multi_tasking
Console
++ MbedOS Error Info ++
Error Status: 0x8001011F Code: 287 Module: 1
Error Message: Operator new[] out of memory

Location: 0x800F025
File: mbed_retarget.cpp+1848
Error Value: 0x5000
Current Thread: main Id: 0x240035B0 Entry: 0x8013581 StackSize: 0x2000 StackMem: 0x24000C20 SP: 0x240022E4 
Next:
main  State: 0x2 Entry: 0x08013581 Stack Size: 0x00002000 Mem: 0x24000C20 SP: 0x240022C8
Ready:
rtx_idle  State: 0x1 Entry: 0x080143A9 Stack Size: 0x00000380 Mem: 0x240035F8 SP: 0x24003928
Wait:
rtx_timer  State: 0x83 Entry: 0x08015081 Stack Size: 0x00000300 Mem: 0x24003978 SP: 0x24003C18
Delay:
For more info, visit: https://mbed.com/s/error?error=0x8001011F&osver=61700&core=0x411FC271&comp=1&ver=6160001&tgt=DISCO_H747I

Note that for getting additional error information, you need to modify the Mbed OS configuration as illustrated below:

mbed_app.json
"target_overrides": {
  "*": {
    "platform.error-all-threads-info":  1,
    "platform.error-filename-capture-enabled": 1
  }
}

From the error log above, we can observe that the system cannot allocate a specific object from the operator new() called from the main thread. We also know that the error happens at line 1848 of the “mbed_retarget.cpp” file.

Heap Fragmentation

A problem that is even more complex to detect is the problem of heap fragmentation. Heap fragmentation is a phenomenon that creates small fragments of memory in the heap space in a way that makes the largest available block of memory smaller and smaller as compared to the total available memory. The fragmentation level can be computed as a ratio between the largest available block of memory and the total available memory:

\(fragmentation = 1 - \frac{largest\ available\ block}{total\ available\ memory}\)

If the fragmentation is \(50\%\) and the available memory is 1  KiB, then the largest available block is 512  bytes. Fragmentation tends to increase over the lifetime of a program and on embedded systems running C++ programs, there is no way of defragmenting the heap. Over time, heap fragmentation tends to

  • create unreliable programs: if your program needs a bigger block than the largest available one, it will not get it and will stop working
  • and to degrade program performance: a highly fragmented heap is slower because the memory allocator takes more time to deliver a new allocated block.

These are very good reasons for using heap memory with care on embedded systems.

For illustrating the heap fragmentation phenomenon, you may create use the following MemoryFragmenter class in your BikeComputer program:

MemoryFragmenter class
multi-tasking/memory_fragmenter.hpp
#pragma once

#include "mbed.h"
#include "memory_logger.hpp"

namespace multi_tasking {

class MemoryFragmenter {
   public:
    // create a memory leak in the constructor itself
    MemoryFragmenter() {}

    void fragmentMemory() {
        // create a memory logger
        MemoryLogger memorLogger;

        // get heap info
        mbed_stats_heap_t heapInfo = {0};
        mbed_stats_heap_get(&heapInfo);
        uint32_t availableSize =
            heapInfo.reserved_size - heapInfo.current_size - heapInfo.overhead_size;
        tr_debug("Available heap size is %" PRIu32 " (reserved %" PRIu32 ")",
                 availableSize,
                 heapInfo.reserved_size);

        // divide the available size by 8 blocks that we allocate
        uint32_t blockSize = (availableSize - kMarginSpace) / kNbrOfBlocks;
        tr_debug("Allocating blocks of size %" PRIu32 "", blockSize);
        char* pBlockArray[kNbrOfBlocks] = {NULL};
        for (uint32_t blockIndex = 0; blockIndex < kNbrOfBlocks; blockIndex++) {
            pBlockArray[blockIndex] = new char[blockSize];
            if (pBlockArray[blockIndex] == NULL) {
                tr_error("Cannot allocate block memory for index %" PRIu32 "",
                         blockIndex);
            }
            tr_debug("Allocated block index  %" PRIu32 " of size  %" PRIu32
                     " at address 0x%08" PRIx32 "",
                     blockIndex,
                     blockSize,
                     (uint32_t)pBlockArray[blockIndex]);
            // copy to member variable to prevent them from being optimized away
            for (uint32_t index = 0; index < kArraySize; index++) {
                _doubleArray[index] += (double)pBlockArray[blockIndex][index];
            }
        }
        // the full heap (or almost) should be allocated
        tr_debug("Heap statistics after full allocation:");
        memorLogger.getAndPrintHeapStatistics();
        // delete only the even blocks
        for (uint32_t blockIndex = 0; blockIndex < kNbrOfBlocks; blockIndex += 2) {
            delete[] pBlockArray[blockIndex];
            pBlockArray[blockIndex] = NULL;
        }
        // we should have half of the heap space free
        tr_debug("Heap statistics after half deallocation:");
        memorLogger.getAndPrintHeapStatistics();

        // trying to allocated one block that is slightly bigger
        // without fragmentation, this allocation should succeed
        heapInfo = {0};
        mbed_stats_heap_get(&heapInfo);
        availableSize =
            heapInfo.reserved_size - heapInfo.current_size - heapInfo.overhead_size;
        tr_debug("Available heap size is  %" PRIu32 " (reserved  %" PRIu32 ")",
                 availableSize,
                 heapInfo.reserved_size);
        blockSize += 8;
        // this allocation will fail
        tr_debug("Allocating 1 block of size %" PRIu32 " should succeed !", blockSize);
        pBlockArray[0] = new char[blockSize];
        // copy to member variable to prevent them from being optimized away
        for (uint32_t index = 0; index < kArraySize; index++) {
            _doubleArray[index] += (double)pBlockArray[0][index];
        }
    }

   private:
    static constexpr uint8_t kNbrOfBlocks  = 8;
    static constexpr uint16_t kMarginSpace = 1024;
    static constexpr uint8_t kArraySize    = 100;
    double _doubleArray[kArraySize]        = {0};
};

}  // namespace multi_tasking

If you create an instance of this class in your BikeComputer program and call the MemoryFragmenter::fragmentMemory() method, you will observe an error on the console similar to the one shown below:

Console
[DBG ][MemoryFragmenter]: Available heap size is 501308 (reserved 506044)
[DBG ][MemoryFragmenter]: Allocating blocks of size 62535
[DBG ][MemoryFragmenter]: Allocated block index  0 of size  62535 at address 0x240055f0
[DBG ][MemoryFragmenter]: Allocated block index  1 of size  62535 at address 0x24014a48
[DBG ][MemoryFragmenter]: Allocated block index  2 of size  62535 at address 0x24023ea0
[DBG ][MemoryFragmenter]: Allocated block index  3 of size  62535 at address 0x240332f8
[DBG ][MemoryFragmenter]: Allocated block index  4 of size  62535 at address 0x24042750
[DBG ][MemoryFragmenter]: Allocated block index  5 of size  62535 at address 0x24051ba8
[DBG ][MemoryFragmenter]: Allocated block index  6 of size  62535 at address 0x24061000
[DBG ][MemoryFragmenter]: Allocated block index  7 of size  62535 at address 0x24070458
[DBG ][MemoryFragmenter]: Heap statistics after full allocation:
[DBG ][MemoryLogger]: MemoryStats (Heap):
[DBG ][MemoryLogger]:   Bytes allocated currently: 504892
[DBG ][MemoryLogger]:   Max bytes allocated at a given time: 504892
[DBG ][MemoryLogger]:   Cumulative sum of bytes ever allocated: 504892
[DBG ][MemoryLogger]:   Current number of bytes allocated for the heap: 506044
[DBG ][MemoryLogger]:   Current number of allocations: 16
[DBG ][MemoryLogger]:   Number of failed allocations: 0
[DBG ][MemoryFragmenter]: Heap statistics after half deallocation:
[DBG ][MemoryLogger]: MemoryStats (Heap):
[DBG ][MemoryLogger]:   Bytes allocated currently: 254752
[DBG ][MemoryLogger]:   Max bytes allocated at a given time: 504892
[DBG ][MemoryLogger]:   Cumulative sum of bytes ever allocated: 504892
[DBG ][MemoryLogger]:   Current number of bytes allocated for the heap: 506044
[DBG ][MemoryLogger]:   Current number of allocations: 12
[DBG ][MemoryLogger]:   Number of failed allocations: 0
[DBG ][MemoryFragmenter]: Available heap size is  251100 (reserved  506044)
[DBG ][MemoryFragmenter]: Allocating 1 block of size 62543 should succeed !

++ MbedOS Error Info ++
Error Status: 0x8001011F Code: 287 Module: 1
Error Message: Operator new[] out of memory

As you can observe, while the available heap size is 251100 bytes, an allocation of 62543 bytes fails with an out of memory error.

For minimizing the type of problems illustrated above, it is often recommended to apply the following guidelines on embedded systems:

  • Privilege the use of static allocation vs. dynamic allocation whenever possible.
  • Privilege the use of automatic allocation (stack) when feasible: allocation on the stack is almost free, but in this case, care must be given to stack overflow errors.
  • Use private, application specific memory pools for providing buffers of fixed size to an application (see Mbed OS Memory Pool). This prevents multiple allocation of buffers from the heap. Note that this mechanism is implemented for instance in the Mbed OS Mail API that implements a queuing mechanism for exchanging messages providing a memory pool for allocating the messages.

Detecting a Stack Overflow Error

By using the memory tracing functionalities demonstrated above, we may know which threads are running and the memory space that they are using. This is very useful information for optimizing memory usage for each thread. This is also useful for debugging stack overflow errors.

Stack overflow may happen in very different situations. For understanding how to detect such errors, it is of course easier to simulate one such error. For this purpose, you may add a code allocating more and more memory on the stack in a thread running a loop. An example of such a code is given below:

MemoryStackOverflow class
multi-tasking/memory_stack_overflow.hpp
#pragma once

#include <cstdint>

#include "mbed.h"

namespace multi_tasking {

class MemoryStackOverflow {
   public:
    void allocateOnStack() {
        // allocate an array with growing size until it does not fit on the stack anymore
        size_t allocSize = kArraySize * _multiplier;
        // Create a variable-size object on the stack
        double anotherArray[allocSize];
        for (size_t i = 0; i < allocSize; i++) {
            anotherArray[i] = i;
        }
        // copy to member variable to prevent them from being optimized away
        for (size_t i = 0; i < kArraySize; i++) {
            _doubleArray[i] += anotherArray[i];
        }
        _multiplier++;
    }

   private:
    static constexpr size_t kArraySize = 40;
    double _doubleArray[kArraySize]    = {0};
    size_t _multiplier                 = 1;
};

}  // namespace multi_tasking

If you call the MemoryLogger::printDiffs() method at regular intervals, you will observe that the maximum number of bytes used on the stack of the thread using the MemoryStackOverflow continuously increases. Once the stack overflow happens, you may experience different types of errors, including an application crash or an application running “crazy”. The reason is that the stack gets corrupted and that no stack corruption protection is implemented in the application.

For improving stack corruption check, you may modify the Mbed OS configuration in the “mbed_app.json” file as follows:

mbed_app.json
"macros": [
    ...
    "RTX_STACK_CHECK=1"
],

If you recompile your application and run with RTX_STACK_CHECK=1, then you should get the following error on the console:

Error log
++ MbedOS Error Info ++
Error Status: 0x80020125 Code: 293 Module: 2
Error Message: CMSIS-RTOS error: Stack overflow
Location: 0x8014291
File: mbed_rtx_handlers.c+60
Error Value: 0x1
Current Thread: rtx_idle Id: 0x24003670 Entry: 0x8014411 StackSize: 0x380 StackMem: 0x24003740 SP: 0x2407FF1C 
Next:
rtx_idle  State: 0x2 Entry: 0x08014411 Stack Size: 0x00000380 Mem: 0x24003740 SP: 0x24003A70
Ready:
Wait:
rtx_timer  State: 0x83 Entry: 0x080150E9 Stack Size: 0x00000300 Mem: 0x24003AC0 SP: 0x24003D60
Delay:
main  State: 0x43 Entry: 0x080135E9 Stack Size: 0x00002000 Mem: 0x24000D68 SP: 0x24002410
For more info, visit: https://mbed.com/s/error?error=0x80020125&osver=61700&core=0x411FC271&comp=1&ver=6160001&tgt=DISCO_H747I
-- MbedOS Error Info --

Unfortunately, the log error does not always indicate a stack overflow. There are situations where the RTX stack check mechanism is not able to detect stack corruption, in which case the application ultimately crashes with a generic fault exception.

Exercice Memory Profiling and Optimization/6

Try to figure out how and where the stack overflow detection is implemented in the RTX OS implementation.

Solution

The check is implemented in the “mbed-os/cmsis/CMSIS_5/RTOS2/RTX/Source/rtx_thread.c” file with the osRtxThreadStackCheck() function. The function basically checks that whether the current stack pointer is beyond the stack memory or whether the value at the top of the stack still contains the stack magic word (initialized at thread creation). The function is called from the SVC_ContextSaveSP assembly function (when `RTX_STACK_CHECK != 0)

Exercice Memory Profiling and Optimization/6

Find and implement another very common way of creating a stack overflow in your BikeComputer program.

Solution

It is as simple as creating a infinite recursive call on a given thread. If you do so for instance in the ProcessingThread thread (with a call to ThisThread::sleep_for() between recursive calls), then you will get a StackOverflow error after a few seconds.