7.3. Boot Troubleshooting Guide¶
Especially during development or bring-up, very early failure situations can leave the system hanging before recovery is even possible.
This guide helps diagnose and debug such issues across barebox’ different boot stages.
7.3.1. Boot Flow Overview¶
A barebox binary consists of two main components:
PBL (Pre-Bootloader): This is a smaller barebones loader that does what’s necessary to download the full barebox binary. At the very least, this is decompressing barebox proper and jumping to it while passing it a device tree. Depending on platform, it may also need to setup DRAM, install a secure monitory like TF-A or a secure operating system like OP-TEE and chainload barebox from a boot medium.
barebox proper: The main bootloader logic. This is always loaded by a prebootloader passing a device tree and including drivers for device initialization, environment setup, and booting the OS.
Refer to the barebox architecture for more background information on the two components and how they map to different boot stages and images.
If barebox hangs, it’s essential to identify where at which boot stage, this failure occurs:
Does the hang happen in the first stage, i.e., while executing from on-chip SRAM?
Or does the hang happen while in the second stage, i.e., while executing from external SDRAM?
And also which component in barebox is affected:
Is the issue in the prebootloader?
Or is barebox proper already loaded and started?
7.3.2. Enable Earlier Console Output¶
Before delving deeper into debugging, make sure to enable following options:
Enable
CONFIG_DEBUG_LL
This enables very early low-level UART debugging. It bypasses console frameworks and writes directly to UART registers. Many boards in barebox, print a
>
character, whenCONFIG_DEBUG_LL
is enabled. If you see such a character after enablingDEBUG_LL
, it indicates that the barebox prebootloader has been found and control was successfully handed over to it. Note that on some SoCs,DEBUG_LL
requires co-operation from the board entry point, e.g., the pin muxing for the serial console needs to be done in software in some situations before the UART is accessible from the outside.Note
Make sure the correct UART index or address is selected under Kernel low-level debugging por in
menuconfig
. Configuring the wrong UART might hang your system, because barebox would be tricked into accessing hardware that’s not there or is powered off. The numbering/addresses of ports are described in the System-on-Chip datasheet or reference manual and may differ from labels on the hardware. Refer to the config symbol help text and/chosen/stdout-path
in the device tree if unsure.Enable
CONFIG_DEBUG_INITCALLS
whileCONFIG_DEBUG_LL
is enabledThis shows output for each initcall level, helping pinpoint where execution stops.
CONFIG_DEBUG_LL
is useful here, because it allows showing output, even before the first serial driver is probed.If you still don’t see any output besides an early
>
ora
character, enableCONFIG_PBL_CONSOLE
andCONFIG_DEBUG_PBL
For boards that don’t have an early
putc_ll('>');
, the first output being printed is often the debugging output from theuncompress.c
entry point (barebox_pbl_start()
). Enable these options to see if the CPU gets that far.Warning
CONFIG_DEBUG_PBL increases the size of the PBL, which can make it exceed a hard limit imposed by a previous stage bootloader. Best case, this will be caught by the build system, but might not if you are adding a new board and haven’t told it yet.
The following sections each start with a list of symptoms, common problems that cause them and what to try to debug them. Skip the sections that don’t align with your symptoms.
7.3.3. Completely Silent Console¶
Even the barebox prebootloader is most often loaded by another bootloader. This is commonly a mask BootROM hardwired into the System-on-Chip.
Symptoms:
Despite enabling the config options described in the previous section, the console is fully silent.
Common problems:
Wrong bootloader image or format
Bootloader installed to wrong location
System hang before serial driver probe
Enabled, but misconfigured CONFIG_DEBUG_LL
What to try:
Check for BootROM boot indicators:
Some BootROMs (e.g. AT91) write to a serial port when they start up or blink a GPIO (e.g. STM32MP) if they fail to boot the next stage bootloader.
Check that barebox is in the format and at the location that the previous stage bootloader expects.
Compare with a previously working bootloader image, refer to the barebox documentation and/or the vendor documentation or ask around.
Output a character from the entry point
If you don’t have any calls to
putc_ll
already, you can stick your ownputc_ll('>');
there and see if it makes it to the serial port. Compare with other boards to see what initialization is needed for a serial port (pinmux, clocks, baudrate, …etc.)Toggle a GPIO from the board entry point
A number of platforms (e.g. i.MX or STM32MP) have header-only GPIO helper functions that can be used to toggle a GPIO. These can be used for debugging early hangs by toggling an LED for example.
Trace BootROM activity
If you have no indication that the barebox prebootloader is being started, consider tracing what the BootROM is doing, e.g. via JTAG or a logic analyzer for the SD card.
If you managed to get some way to output debug info, move along to the next step.
7.3.4. Hang after First Stage PBL Console Output¶
The first stage prebootloader handles: - Basic initialization (e.g., clocks, SDRAM) - Installation of secure firmware if applicable - Invocation of the second stage
Symptoms:
You see some output from the prebootloader, but you don’t see any debug messages starting with
uncompress.c:
.
Common problems:
Issues in board entry point
Hang in firmware
What to try:
Check where hang occurs
If you get just some early output, you’ll need to pinpoint, where the issue occurs. If enabling
CONFIG_PBL_CONSOLE
along with a correctly configuredCONFIG_DEBUG_PBL
doesn’t help, try addingputc_ll('@')
(or any other character) to find out, where the startup is stuck.putc_ll
has the benefit of being usable everywhere, even beforesetup_c()
is orrelocate_to_current_adr()
is called. Once these are called, you may also useputs_ll()
or just normalprintf
ifCONFIG_PBL_CONSOLE=y
.Check if hang occurs in other loaded firmware
On platforms like i.MX8/9 and RK35xx, barebox will install ARM trusted firmware as secure monitor and possibly OP-TEE as secure OS. Hangs can happen if TF-A or OP-TEE is configured to access the wrong console (hang/abort on accessing peripheral with gated clock). If output ends with the banner of the firmware, jumping back to barebox may have failed. In that case, double check that the memory size configured for TF-A/OP-TEE is correct and that the entry addresses used in barebox and TF-A/OP-TEE are identical.
7.3.5. Hang During Chainloading¶
Once basic system initialization is done, barebox prebootloader will load the second stage.
Symptoms:
You see debug messages starting with
uncompress.c:
, but none that start withstart.c:
Common problems:
Wrong SDRAM setup
Corrupted barebox proper read from boot medium
What to try:
Check computed addresses
If your last output is
jumping to uncompressed image
, this suggests that the hang occurred while trying to execute barebox proper. barebox prints the regions it uses for its stack, barebox itself and the initial RAM as debug output. Verify these with the actual size of RAM installed and check if values are sane.Check that barebox was loaded correctly
You can enable
CONFIG_COMPILE_TEST
andCONFIG_PBL_VERIFY_PIGGY
to have the barebox build system compute a hash of barebox proper, which the prebootloader will compare against the hash it computes over the compressed data read from the boot medium.Check SDRAM setup
SDRAM setup differs according to the RAM chip being used, the System-on-Chip, the PCB traces between them as well as outside factors like temperature. When a System-on-Module is used, the hardware vendor will optimally provide a validated RAM setup to be used. If RAM layout is custom, the System-on-Chip vendor usually provides tools for calculating initial timings and tuning them at runtime.
Because writes can be posted, issues with wrongly set up SDRAM may only become apparent on first execution or read and not during mere writing.
Issues of writes silently misbehaving should be detectable by
CONFIG_PBL_VERIFY_PIGGY
, which reads back the data to hash it.If the prebootloader is already running from SDRAM, boot hangs due to completely wrong SDRAM setup are less likely, but running a memory test from within barebox proper is still recommended.
Check if an exception happened
barebox can print symbolized stack traces on exceptions, but support for that is only installed in barebox proper. Early exceptions are currently not enabled by default, but can be enabled manually with
CONFIG_ARM_EXCEPTIONS_PBL
.
7.3.6. Preinitcall Stage¶
The prebootloader barebox_pbl_start
ends up calling barebox_non_pbl_start
in barebox proper. This function does:
relocation and setting up the C environment
setting up the
malloc()
area and KASANcalling
start_barebox
, which runs the registered initcalls
Symptoms:
You see debug messages starting with
start.c:
, but none that start withinitcall->
Common problems:
None, this is quite straight-forward code
What to try:
Check if the code is executed. This can be done with
putc_ll
.printf
is not safe to use everywhere in this function, because the C environment may not be set up yet.
7.3.7. Initcall Stage¶
After decompression and jumping to barebox proper, barebox will walk through the compiled in initcalls.
Symptoms:
You see debug messages starting with
initcall->
, but system hangs before reaching a shell
Common problems:
Hangs during hardware initialization
What to try:
Enable
CONFIG_DEBUG_PROBES
Initcalls don’t necessarily correspond to driver probes as a driver may be registered before a device or the device probe is postponed until resources become available.
This option prints each driver probe attempt and can help isolate the problematic peripheral.
Check what was the last executed function was
Each
initcall->
log message is followed by a barebox function name. Eachprobe->
log message is followed by the name of the device about to be probed. This should make it possible to pinpoint where the hang occurred.Add extra debugging in the file of the hang
You can add
#define DEBUG
at the start of any barebox file (before the C headers!) to print out all debug messages for that file regardless of log level.Isolate where exactly the hang occurs
By spreading some
pr_notice("%s:%d\n", __func__, __LINE__);
around the driver, you should be able to pinpoint what causes the hang.Disable drivers selectively to see if a shell can be reached.
This allows you to see if the hang is a general problem or if it’s only caused by a single device driver.
7.3.8. Interactive Console¶
Symptoms:
You see output only with
CONFIG_DEBUG_LL
, but not otherwise
Common problems:
No consoles are enabled or the user is looking at the wrong console.
What to try:
Enable
CONFIG_CONSOLE_ACTIVATE_ALL
Useful for testing. Instructs barebox proper to print out logs on all console devices that it registers.
Enable
CONFIG_CONSOLE_ACTIVATE_ALL_FALLBACK
after figuring out correct consoleThis will fall back to activating all consoles, when no console was activated by normal means (e.g., via the environment or the device tree
/chosen/stdout
property).This should make it easier to debug similar issues in future should you run into them.
7.3.9. Kernel Hang¶
Symptoms:
Hang after a line like
Loaded kernel to 0x40000000, devicetree at 0x41730000
With kernel hangs, it’s important to find out, whether the hang happens in barebox still or already while executing the kernel. Without EFI loader support in barebox, there is no calling back from kernel to barebox, so a kernel hanging is usually indicative of an issue within the kernel itself.
It’s often useful to copy the kernel image into /tmp
instead of booting directly
to verify that the hang is not just a very slow network connection for example.
The -v
option to cp - copy files is useful for that.
The file size copied may differ from the original if the mean of transport rounds
up to a specific block size. In that case, round up the size on the host system
and run a digest function like md5sum - calculate MD5 checksum to check that the image
was transferred successfully.
If the image is transferred correctly, the boot - boot from script, device, … verbosity is increased
by each extra -v
option. At higher verbosity level, this will also print out
the device tree passed to the kernel. The of_diff - diff device trees command is useful
to visualize only the fixups that were applied by barebox to the device tree.
If you are sure that the kernel is indeed being loaded, the earlycon
kernel
feature can enable early debugging output before kernel serial drivers are loaded.
barebox can fixup an earlycon option if global.bootm.earlycon=1
is specified.
7.3.10. Spurious Aborts/Hangs¶
Symptoms:
Hangs/panics/aborts that happen in a non-deterministic fashion and whose probability is greatly influenced by enabling/disabling barebox options and corresponding shifts in the barebox binary
It’s generally advisable to run a memory test to verify basic operation and to check if the RAM size is sane. barebox provides two commands for this: memtest - extensive memory test and memtester - memory stress-testing. In addition, some silicon vendors like NXP provide their own memory test blobs, which barebox can load to SRAM via memcpy - memory copy and execute using go - start application at address or file. By having the memory test outside DRAM, a much more thorough memory test is possible.
With CONFIG_MMU=y
, the decompression of barebox proper in the prebootloader
and the runtime of barebox proper will execute with MMU enabled for improved performance.
This increase in performance is due to caches and speculative execution. barebox will mark memory mapped I/O devices and secure firmware as ineligible for being accessed speculatively, but it can only do so if the memory size it’s told is correct and if secure memory is marked reserved in the device tree.
The memory map as barebox sees it can be printed with the iomem - show IO memory usage
command. Everything outside ram
region is mapped non executable and uncacheable
by default. Everything inside ram
regions that doesn’t have a [R]
next
to it is cacheable by default. The mmuinfo - show MMU/cache information of an address command can be used
to show specific information about the MMU attributes for an address.
7.3.11. Memory Corruption Issues¶
Some hangs might be caused by heap corruption, stack overflows, or use-after-free bugs.
What to try:
Enable
CONFIG_KASAN
(Kernel Address Sanitizer)This provides runtime memory checking in barebox proper and can detect invalid memory accesses.
Warning
KASAN gratly increases memory usage and may itself cause hangs in constrained environments.
7.3.12. Summary of Debug Options¶
Option |
Description |
---|---|
CONFIG_DEBUG_LL |
Early low-level UART output |
CONFIG_PBL_CONSOLE |
Print statements from PBL |
CONFIG_DEBUG_PBL |
Enable all debug output in the PBL |
CONFIG_PBL_VERIFY_PIGGY |
Verify barebox proper in PBL before decompression |
CONFIG_ARM_EXCEPTIONS_PBL |
Enable exception handlers in PBL |
CONFIG_DEBUG_INITCALLS |
Logs each initcall |
CONFIG_DEBUG_PROBES |
Logs each driver probe |
CONFIG_KASAN |
Detects memory corruption |
7.3.13. Final Tips¶
Reach out to other barebox users
Search the mailing list, send a mail yourself or ask on IRC/Matrix.
If all else fails, a JTAG debugger to single-step through the code can be very useful. To help with this,
CONFIG_PBL_BREAK
triggers an exception at the start of execution of the individual barebox stages, whichscripts/gdb/helper.py
can use to correctly set the base address, so symbols are correctly located.