Embedded Linux Boot Time Optimization
Some applications have specific requirements for a system’s boot time. Often the system does not need to be immediately ready for all its tasks, but it should be ready for certain mission-critical tasks (e.g. accepting commands over Ethernet or displaying a user interface). This article provides a few methodologies and low-hanging fruit for improving boot time on Toradex System on Modules.
Note: A few tips mentioned in this article require recompiling the U-Boot, Kernel or rebuilding a root file system from scratch. Please refer to their respective articles on our developer website.
Before starting the optimization, we need an appropriate method to measure the boot time. If an exact end-to-end boot time is required, it might even be necessary to involve the hardware (e.g. GPIOs and an oscilloscope). In most cases simple monitoring of the serial port from a host system is accurate enough. A popular utility tool to monitor the timings of serial output is Tim Bird's grabserial. This utility tool adds a timestamp to each line captured from the serial port as shown below:
$ ./grabserial -d /dev/ttyUSB1 -t [0.000002 0.000002] [0.000171 0.000169] [0.000216 0.000045] U-Boot 2015.04-00006-g6762920 (Oct 12 2015 - 15:35:50) [0.005177 0.004961] [0.005227 0.000050] CPU: Freescale Vybrid VF610 at 500 MHz [0.008938 0.003711] Reset cause: POWER ON RESET [0.011153 0.002215] DRAM: 256 MiB [0.063692 0.052539] NAND: 512 MiB [0.065568 0.001876] MMC: FSL_SDHC: 0
The first number represents the time stamp (since the first character was received) while the second number shows the delta between the timestamps of the current and the last line.
This article is generally applicable to all of our modules. However, I do present some measurements and improvements specifically using our NXP®/Freescale Vybrid-based module - Colibri VF61.
There are roughly three phases of a Linux system boot, which we are listed below and will be examined during the course of this blog.
- Boot loader
- Linux kernel
- User space (init system)
There are actually two more phases before the boot loader can run: Hardware initialization and boot ROM. The hardware initialization phase is needed to fulfil power sequencing requirements and bus or SoC reset timing requirements. This phase is usually fixed and in the range of 10-200ms. Arm SoCs boot from a firmware stored on an internal ROM. This firmware loads the boot loader from the boot media. The runtime is usually rather short and influenced by the boot loader’s size. Other than minimizing the boot loader’s size, optimizations are rather hard. Real optimization potential and flexibility are within the boot loader (U-Boot).
With the current release V2.5 Beta 1, the time from the first character to the Kernel start is ~1.85 seconds. This involves the following steps:
- U-Boot initialization (~110ms, measured from the first character received)
- Autoboot delay (1s)
- UBI initialization and UBIFS mount (~300ms, thanks to a feature called Fastmap. Without Fastmap it would take around 1.6s)
- Loading the kernel (375ms)
- Loading and patching the device tree (~35ms)
- And finally jump into the kernels start address
The obvious optimization is reducing the Autoboot delay. This can be set to zero using:
setenv bootdelay 0 saveenv
This can also be configured as a default by using the CONFIG_BOOTDELAY config symbol. But in the current release, with a bootdelay of 0, there is no way to get into the boot loader’s console directly. U-Boot provides an option called CONFIG_ZERO_BOOTDELAY_CHECK which will check for one character even if the bootdelay is 0. We have added this option to our default configuration for the next release.
Serial output is sent synchronously. This means that the CPU waits until the character has been sent over the serial line. Therefore, each character that is printed slows down the U-Boot. Especially since UBI prints a lot of information messages, there is potential for optimization. It turns out that there is a config symbol CONFIG_UBI_SILENCE_MSG.
Ensuring that the hardware is used as efficiently as possible needs insight into what the hardware is capable of and what is currently being implemented. A missing feature till now was the Level 2 Cache (only on Colibri VF61). After implementing Level 2 cache, the boot time improved by more than 40ms.
Removing certain features helps to decrease the relocation time and initialization of such features. By removing Display support (DCU), EXT3 and EXT4 support as well as USB peripheral drivers such as DFU and mass storage. It helped us to decrease the size of U-Boot to 366kB and shaved away another 10ms.
According to the timestamps, most of the time is spent in attaching UBI and mounting the UBIFS as well as loading the kernel (~380ms). Obviously, the kernel size and the load time correlate linearly hence optimizing the kernel size will help to improve the boot time further.
To measure the kernel boot time only, the “match” feature of grabserial can be used to reset the timestamp in the last message printed by the boot loader:
./grabserial -d /dev/ttyUSB1 -t -m "^Starting kernel.*"
The end of the boot time is somewhat hard to determine, since the kernel continues to initialize hardware even after the root file system has been mounted and the first user space process (init) starts running (delayed initialization). The string “Freeing unused kernel memory” is the last message emitted before the init process is started, and hence marks the end of the kernels “linear” init procedure (see kernel_init in init/main.c). We will use the timestamp of that message to compare boot times. The shipped kernel has a zipped size of 4316kB and a boot time of 2.56 seconds.
Similar to U-Boot, the Linux kernel prints all messages synchronously to the serial console. The exact behaviour depends on the serial console used, but the LPUART (the driver for Vybrid’s console) waits synchronously until the character is sent over the serial port. This has the advantage that when the kernel crashes, all the messages up to that point are visible. If the messages were sent asynchronously, the last visible message would not indicate the location of a crash…
The kernel has an argument to minimize the amount of kernel messages displaying: “quiet”. However, this also silences our anchor for the boot time measurement (“Freeing unused kernel memory”). The easiest way to get the message back on the screen is to elevate the log level for that particular print statement. It is located in ‘mm/page_alloc.c’ - search for “Freeing %s memory”. I elevated the message to ‘pr_alert’. The measurement showed an improvement of 1.55 seconds, which is an improvement greater than factor of 2!
The easiest way to archive further improvements is by removing features. The Yocto project has a handy tool called ksize.py which needs to be started from within a kernel build directory. The tool prints tables identifying the size of individual kernel parts. The first table shows a high level overview (use make clean before building to get an accurate overview):
Linux Kernel total | text data bss ------------------------------------------------------------------- vmlinux 8305381 | 7882273 247732 175376 drivers/built-in.o 2010229 | 1881545 109796 18888 fs/built-in.o 1944926 | 1911100 19422 14404 net/built-in.o 1477404 | 1398316 44832 34256 kernel/built-in.o 628094 | 514935 17099 96060 sound/built-in.o 326322 | 316298 8248 1776 mm/built-in.o 288456 | 276492 8000 3964 lib/built-in.o 160209 | 157659 217 2333 block/built-in.o 137262 | 133614 2420 1228 crypto/built-in.o 104157 | 100063 4082 12 security/built-in.o 37391 | 36303 788 300 init/built-in.o 31064 | 16208 14772 84 ipc/built-in.o 29366 | 28640 722 4 usr/built-in.o 138 | 138 0 0 ------------------------------------------------------------------- sum 7175018 | 6771311 230398 173309 delta 1130363 | 1110962 17334 2067
Which features can be removed safely is obviously application specific. Going through the individual high level directories helps to quickly remove the most promising candidates. For this article I removed several file systems (cifs, nfs, ext4, ntfs), the audio subsystem, multimedia support, USB and wireless network adapter support. The kernel ended up at about 3356kB, roughly 1MB less than before. This also decreased the kernel loading time in the boot loader by about ~85ms.
Another improvement idea can be to evaluate different compression algorithm, even though the current default algorithm in our kernel configuration is LZO which is already quite elaborate.
In Linux user space, initialization is done by the init system. The Toradex BSP images use the Ångström standard init system which is systemd. Systemd, the de facto standard init system on the Linux desktop nowadays, is very feature-rich and is especially designed with dynamic systems in mind. Systemd also addresses boot time. Multiple daemons are started simultaneously (leveraging today's multi-core system,); socket activation allows delayed loading of services at a later point in time and device activation allows starting services on demand. Furthermore, the integrated logging daemon journald saves space due to binary-packed log files and sophisticated log file management.
Depending on the application, an embedded system might be rather static. Hence, the dynamic features of systemd are not really needed. Unfortunately systemd is not very modular, or the individual modules have interlocked dependencies. This makes it hard to strip down systemd to a bare minimum. This section is separated into two parts: the first part shows systemd boot optimization techniques; whereas, the second part looks at System V and other alternatives.
In both parts we use the “Freeing unused kernel memory” message as the base time for time measurement:
./grabserial -d /dev/ttyUSB1 -t -m "^\[ *0-9.]* Freeing unused kernel memory.*"
For this blog post, we define the login shell on the serial console as a critical task. The login shell is defined as “Type=Idle”, which means that by definition, it starts only after all services have been started.
To start a headless or framebuffer-based application, one would typically create a new service. Systemd allows defining certain requirements as service needs before it can be started (e.g. Network with “Wants=network-online.target”) and then automatically ensures that the services gets started as soon as the requirements are met. However, since services are started in parallel, the CPU resources get shared amongst them. But still, the application is likely up and running before the serial console comes available, hence the following numbers may appear to be be on the higher side.
The quiet argument in the kernel arguments is also picked up by systemd. This change already has a positive effect on the systemd boot time, shaving off about 1.6s in the process.
systemd provides an utility called systemd-analyze which prints a list of services and their starting time when initiated with the “blame” argument. This allows finding boot time offenders quite easily; however, the values might be misleading since the time is measured according to the wall clock time. A listed service might just be in the sleep state the CPU is processing other work. So the service at the top of the list may not be the biggest boot time offender, especially on single core system.
Services can be disabled using the disable commands. Some services (especially the services provided by systemd itself) might need the mask command to disable them. Some might still be required for the system to operate; hence disabling theservice should be done carefully and only one at a time. For this article, the following services have been disabled:
systemctl disable usbg systemctl disable connman.service # replaced with networkd systemctl mask alsa-restore.service
Systemd comes with its own system logging daemon called journald. It is one of those components that should not be disabled entirely. During booting up the logging daemon needs to manage and delete old log files on the disk as well as write new log entries to the disk. By disabling the logging in to the disk boot time can already be improved, with the cost of having no log files stored of course. Use Storage=none in /etc/systemd/journald.conf to disable the log storage part.
System V init and other alternatives
For many years SysV has been the standard init system also on Linux. Due to its script based init system, it is very modular and relatively easy to strip to a bare minimum. Especially for relatively static systems, where systemd's device activation or socket activation are not needed, SysV is a good alternative.
The Yocto project’s reference distribution “poky”, which I blogged about in my last article[The Yocto Project's Reference Distribution “Poky” on Toradex Hardware], uses SysV by default. Using the ‘minimal-console-image’ and a static IP address configuration, the measured user space boot time on Colibri VF61 is ~2.3s.
The meta-yocto layer also provides ‘poky-tiny’, which uses just a shell script as the init system. Just replace the distribution with “poky-tiny” and build the usual Yocto image, such as ‘console-image-minimal’. The distribution is meant to be used as an initramfs; however, by removing MACHINE_ESSENTIAL_EXTRA_RDEPENDS, IMAGE_FSTYPES and PREFERRED_PROVIDER_virtual/kernel from the conf/distro/poky-tiny.conf file, I am able to build a working UBIFS image. To properly “reconfigure” the distribution for a flashable root file system, one should create a new distribution layer and copy the distribution configuration file. The “boot time” to the shell is obviously very fast (220ms), allowing execution of a simple command with an overall boot time of just below 2 seconds. But, it also provides almost no features other than mounting the root file system, some basic virtual file system support and a shell. Still, depending on the amount of features needed in a project, this could be a good starting point.
Jordon Wu - 3 years 1 month | Reply
Where to get the Boot Time Optimization patch or config? thanks
Toradex - 3 years | Reply
Hi Jordon, there is a OpenEmbedded Layer which contains the changes as patch files: https://github.com/toradex/meta-toradex-extra/tree/toradex-fast-boot
sukesh - 3 years 9 months | Reply
this artical is nice you can also remove this 3 steps
[0.005227 0.000050] CPU: Freescale Vybrid VF610 at 500 MHz
[0.011153 0.002215] DRAM: 256 MiB
[0.063692 0.052539] NAND: 512 MiB
at production level there is no needed.
Stefan Agner - 3 years 9 months | Reply
Thank you for your feedback Sukesh. Since all output is sent synchronously over the serial output, every omitted character saves time... with the cost of having less debug information. I agree the mentioned lines are not very important in a product and hence good candidate to omit. You can also use the CONFIG_SILENT_CONSOLE along with the environment variable "silent" to get rid of all console output.
Sergey - 2 years 8 months | Reply
Hallo Stefan, ich kann nicht das "toradex-tiny" mit neuem branch "krogoth" bauen.
ERROR: Unbuildable tasks were found.
These are usually caused by circular dependencies ...
Würdest Du mir weiter helfen.
Stefan Agner - 2 years 8 months | Reply
Hi Sergey, So far we did not port the demo to krogoth, but I don't see any practical issue why it should not work. Note that I did not use a special image, I just used the standard image "console-image-minimal" along with the poky-tiny distribution. If you still experience problems, use our official Support channels https://www.toradex.com/support (preferable Community) and include detailed information of what you exactly did. Thanks, Stefan.
Leave a comment
Your email ID will be kept confidential. Required fields are marked *
* Your comment will be reviewed and then added. Thank you.