Blog:
Running Qt without a GPU

Tuesday, March 23, 2021
KDAB

KDAB

Is it possible to get complex graphical software like Qt running smoothly on a small, economical device like the Toradex Colibri iMX6ULL? Yes – it sure is.

At KDAB, we first experimented with this on an NXP® i.MX6 ULL development board together with The Qt Company, then by ourselves on the Toradex's Colibri iMX6ULL. Read on to learn about our journey and what we discovered.

Toradex Colibri iMX6ULLThe Toradex Colibri iMX6ULL is powerful enough to run Qt Quick apps and H.264 video playback.
Qt and the i.MX6ULL

If you’re not familiar with the i.MX6 ULL, it’s an option in the i.MX6 applications processor family that won’t break the bank and is energy efficient. This part’s tradeoff, especially if you’re used to GHz-level quad core chips, is in performance. It comes with a single Cortex-A7 core that was clocked at 528 MHz on our NXP development board. It also doesn’t have a graphics accelerator. (Well, it mostly doesn’t – it does have a unit with some basic pixel processing assistance that we’ll be discussing later on.) The display that came bundled with the development board from NXP has a resolution of 480×272 pixels.

Back in the old days Qt Quick relied on OpenGL and running it on systems without a GPU wasn’t feasible. However, Qt Quick now has a software rendering option, making it possible to use Qt Quick on devices without a GPU. As you can imagine, rendering all graphics through a non-accelerated Linux frame buffer wasn’t terribly fast.

However, after some fixes developed by KDAB and The Qt Company, the performance on the NXP i.MX6 ULL is now very competitive. Video playback on the Toradex Colibri iMX6ULL is smooth too – with assistance from the i.MX6 ULL’s PXP engine and within some constraints.

Starting with the software renderer

At the beginning, our primary focus was on analyzing and optimizing text rendering and full screen Qt Quick animations. KDAB did performance measurements on the device using perf and Hotspot and found some significant bottlenecks. Together with The Qt Company, we developed some noteworthy patches.

  • In the Qt Quick software renderer, we removed two unneeded alpha blending operations. Images without an alpha channel were always being blended; now they are simply copied. Also, layers were formerly blended unconditionally; now, whether a layer covers its area completely is checked and not blended if not needed. Removing unneeded alpha blending eliminates a handful of expensive pre-pixel reads and multiplies.
  • Font drawing was improved. This included removing unnecessary temporary allocations, as well as simplifying the computation for glyph drawing when no gamma correction is used.
  • Finally, we created a new plug-in for the Linux frame buffer, linuxfb. This plugin implements window compositing which, in the rather common special case of having only one visible window, skips using a temporary compositing buffer and directly copies the window to the frame buffer.

Because patches are typically merged into Qt, you’ll get the benefit of these patches in a more recent version (5.12.5 or later). Notably, the prototype plug-in for linuxfb is not integrated yet – ask us for it if you need it.

Qt Quick Software RendererOptimizing Qt Quick’s software rendering made UIs run more smoothly in our tests
Minimizing Qt library size

After these improvements, our focus was on reducing the Qt library size. Qt library size optimizations are not mandatory: after all, the Toradex Colibri iMX6ULL module comes with 512MB of flash memory. But it was an important criterion for us as smaller images can help improve the application load time, provide more space for user data, and reduce network times with software updates.

Qt offers a number of configuration script options that will include or exclude modules based on your needs; this can significantly shrink the Qt library size. Most features default to either always enabled or enabled when the required third-party library is available. KDAB’s goal was to find a sensible collection of options for a minimal Qt build that supported real world embedded Qt Quick applications yet wouldn’t change based on library availability.

In addition to the relatively coarse-grained options in the configuration script, Qt also has a tool called Qt Lite that can achieve even more fine-grained control over library size. To simplify the task as well as make the results more generally applicable, we opted not to use Qt Lite; it’s more of a special purpose tool, needed for when every byte really counts.

The main third party library dependencies of Qt 5.9.1 (the version used when we investigated this) are zlib and pcre. But for a reasonable base for UI development, at least text and image rendering should be supported, so freetype, harfbuzz, jpeg, and png support also needed to be enabled.

Here is our list of options for a minimal Qt build:
-system-zlib -qt-pcre -qt-freetype -qt-libjpeg -qt-libpng -qt-harfbuzz \ 
-no-cups -no-iconv -no-sm -no-feature-vnc -no-widgets -no-ico -no-gif -no-glib -no-gtk \ 
-no-sql-mysql -no-sql-psql -no-sql-sqlite -no-sql-sqlite2 \ 
-no-icu -no-openssl -no-fontconfig -no-dbus -no-qml-debug

If you’re building your own minimal Qt build, consider this a starting point, since of course, in many cases you may need to re-enable some options for features your application uses. Please note we also omitted project-specific options regarding hardware integration.

Finally, we used options to shrink the compiler’s generated code:
-optimize-size -ltcg

A build for all installed Qt shared libraries (including the QML and Quick modules) using all these options created a file size totaling about 15 MiB. Compiling Qt statically and linking it into a basic Hello-World Qt Quick application produced a binary of about 8.3 MiB, after stripping. One interesting observation is that the link time optimization (LTO) does indeed help noticeably. Without it, the dynamic libraries were around 800 KiB larger and the static binary 1.5 MiB larger.

(Checking how things look today with Qt 5.15.2 and the same options, Qt has grown a bit. A dynamic library build now without LTO comes in around 25.7 MiB and with LTO 23.8 MiB.)

PXP: pixel pipeline (and video helper)

Our next task was to investigate the possibility for playing back videos. That’s a much bigger load on a processor that doesn’t have any hardware video decode or graphic acceleration. And that’s where the i.MX6 ULL’s Pixel Pipeline (PXP) comes in. Although the PXP isn’t a GPU, it can do some basic graphical operations, like color space conversion, blitting, scaling and blending. Since most popular video formats require a frame-by-frame YCbCr to RGB color conversion, a pretty CPU-intensive task, it is well worth it to use PXP alone for that. Luckily, since the imx GStreamer plugin supports the PXP, coding up a proof of concept was quite easy.

We integrated the GStreamer video decoding into our Qt Quick application by creating a simple video decoding pipeline. At the end, the imx PXP plugin converted the video frames to raw RGB images. Then we extracted the images out of the pipeline with an appsink element and drew them using a custom QQuickPaintedItem.

With this setup, a surprisingly smooth 25fps H.264 full screen playback was possible, at around 40% CPU usage. Of course, full screen on this device is slightly less than half of VGA resolution. We haven’t tried video playback at larger resolutions, but we were able to verify that a regular Qt Quick UI works well on a larger 800×480 screen.

Future improvements: Squeezing more from the PXP

Aside from using the PXP for video playback, we also think the PXP could help in standard Qt Quick rendering. For regular application UI rendering, significant CPU time is spent copying the rendering result to the frame buffer. Although we started to experiment with hacking up the Qt Linux frame buffer plugin to use the PXP bit blitter to do this, it turned out the performance was good enough not to need it. However, it’s something we would definitely dust off and polish if we needed to squeeze a bit more performance out of the system. This idea could be taken even further when combining it with video playback, by letting PXP also do the compositing of the video frame with the application UI.

Conclusion

When it came time to build a demo on the Toradex Colibri iMX6ULL, we were easily able to put one together with our experience on the underlying NXP i.MX6 ULL chip. With both video and Qt support, the demo appears like its running on a heftier processor – although it’s on the budget-friendly and energy-efficient Toradex Colibri. It wasn’t too hard – and especially since many of the fixes are now in the mainline Qt (or explained in this blog), it should be a snap for you to bring your apps over too. Reach out to KDAB if you'd like some help.

 Zeno EndemannAuthor: Zeno Endemann
Bio: Zeno is a KDAB software engineer who specializes in C++ and Qt development, embedded Linux applications, performance analysis, hardware integration, and application optimization. Zeno has a degree in Computer Science and has worked for KDAB since 2013.

Author: Zeno Endemann, Senior Software Engineer, KDAB

Leave a comment

Please login to leave a comment!
Have a Question?