Qt Labs Blogs

Symbian development using Linux

Posted by Thomas Zander

in S60, Symbian, C++, Build system

on Wednesday, April 21, 2010 @ 13:50

Programming your application or library based on Qt has always had the promise that you can deploy your application on many different platforms. Development of those applications can, likewise, happen on many different platforms. QtCreator runs on Windows, Mac & Linux among others.
Since Qt4.6 Symbian is also one of those platforms to deploy on, your Qt apps can run on one of the many many Symbian based phones already out there.
For developers to be able to deploy to Symbian there was one problem, you’d have to use Windows as your development platform. Here in Qt Development Frameworks we recognize that a large amount of development is done on Linux. Especially open source developers have made the point that developing Symbian applications should work on Linux.

So, today, I’m happy to announce that developing Qt applications for the Symbian platform is possible on Linux with the upcoming Qt4.7. This will be experimental for now. Please give feedback on how well it works for you!

What this means is that developers using a Linux system can use a freely available cross-compiler and the also freely available Symbian tools to create applications for a Symbian based phone.
Developers that are working on Qt itself will now be able to do so on Linux too.

Preparing with a Qt compile
Symbian has a bad reputation of ease of development, you would be excused if you think ‘preparing’ means something along the lines of including some soul-searching and prayer. That’s all to change, I’m convinced, with Qt entering this arena.
The preparations here are essentially the download of the required tools.

First you need to compile Qt for Symbian. This is a step you would be able to avoid after the final 4.7 is out and you can just download a binary. You can either use the upcoming Qt47-beta or clone the git repository; http://qt.gitorious.org/qt/qt

Second we need a compiler that can cross compile to the arm instruction set which is what phones use; there are two compilers that are known to work, the rvct-2.2 compiler and gcce. As rvct is not freely available I’ll focus purely on gcce here. Gcce can be found at http://www.codesourcery.com the “Sourcery G++ Lite Edition” is what you are looking for. Direct link

Since we are going to compile for a Symbian platform, we need to download the headers and libraries etc to link to. The things we need are included in the “S60 5th Edition SDK for Symbian OS”
Apologies for the requirement to register there.

From the same site we need s60_open_c_cpp_plug_in_v1_6_en.zip

These downloads were created for Windows users and as such we need to massage it a bit to be useful for Linux use. We do NOT use the setup.exe included in there, instead the way to do this is to use the gnupoc package; http://www.martin.st/symbian/ version 1.15 worked for me.
After making sure the Linux command ‘patch’ is available on your system and unpacking the gnupoc archive you can
cd gnupoc-package-1.15/sdk ./install_gnupoc_s60_50 ~/S60_5th_Edition_SDK_v1_0_en.zip ~/symbian-sdk
and
./install_openc_16_s60 ~/work/s60_open_c_cpp_plug_in_v1_6_en.zip ~/symbian-sdk

Last step is to download this symbiansdk-gcce.diff.gz and after unzipping apply it to your new ~/symbian-sdk directory using;
cd ~/symbian-sdk patch -p0 < symbiansdk-gcce.diff

The next step is to install ‘wine’ to run some of the tools. Your Linux distro likely has it, so use your distro package manager to install it.
Check with ‘wine ~/symbian-sdk/epoc32/tools/unzip.exe‘ to see if it works. You should get the help message from ‘unzip’ after calling the above command.

To make the buildsystem work smoothly we copy 3 utils into the ‘windows’ directory to make sure its always in the path; the make.exe, mifconv.exe and the uidcrc.exe files that are stored in the ~/symbian-sdk/epoc32/tools/ should be copied to ~/.wine/drive_c/windows/ dir.
The running of ‘unzip.exe’ above should have taken care of creating the ‘windows’ dir.

Setup environment
we need to export some variables; you likely want to add these lines to your $HOME/.bashrc or similar.
export EPOCROOT=$HOME/symbian-sdk/ QTDIR=$HOME/build/qt gcceDir= full/path/to/arm-2009q3/bin export PATH=$QTDIR/bin:$EPOCROOT/epoc32/tools:$gcceDir:$PATH

Compile Qt; call configure and ‘make’.
It is strongly suggested to do out-of-source building. Also called ’shadowbuilds’. This leaves the source tree free from any auto-generated files. So we do that here.
In the first step you downloaded the Qt sources, probably in a directory like $HOME/qt Go ahead and create a new directory called $HOME/build/qt, then cd into that newly created directory and run;
$HOME/qt/configure -platform linux-g++ -xplatform symbian/linux-gcce -arch symbian -no-webkit make cd src/s60installs make sis

When all this went well you should have a Qt.sis in your build/qt/lib/ dir. Read on for what that gives you.

What next?
a file with extension .sis or .sisx are essentially installer packages that Symbian devices take. You can use a usb cable or other means to get a sis file onto your device and then click on the file in a filebrowser to install it.
Qt has some dependencies that you may want to install first, if you skip this part your apps likely just crash at startup. Be warned
The follwing sis files are present in the symbian-sdk you downloaded before, so you can copy them to the device and should install them all.
nokia_plugin/openc/s60opencsis/openc_ssl_s60_1_6_ss.sis nokia_plugin/openc/s60opencsis/pips_s60_1_6_ss.sis nokia_plugin/opencpp/s60opencppsis/stdcpp_s60_1_6_ss.sis

On top of that you might want to install these for debugging purposes;
nokia_plugin/openc/s60opencsis/stdioserver_s60_1_6_ss.sis

If we really want to actually see some results, go to an example and install that one too;
cd ~/build/qt/examples/widgets/wiggly make sis
and then install the wiggly.sis on your device like above. You should find a new folder with “QtExamples” where it can be started from.

Extra targets and Qt docs
As noted above; the Symbian based qmake generates a Mkefile that has an extra target called ’sis’. More commands will likely be added later, but this is the essential one for now.
More info about Qt at; http://doc.trolltech.com/4.7-snapshot/how-to-learn-qt.html

What’s still missing
This is not finished in the way that we’ll stop working on this, there are too many steps still in the above instructions. This has to get easier still, we’ll keep working on that. If you hit any issues that took longer than needed to figure out, please tell us in a comment

One known issue is that Symbian has a different way of doing binary compatibility; one that is not implemented in this buildsystem yet. In other words; you need to build any application against the exact same build of Qt as you have on the device.

» 34 Comments «

Insanity is shaping the same text again and expecting a different result

Posted by Eskil Abrahamsen Blomfeldt

in Qt, Painting, OpenGL, Performance, C++

on Monday, March 01, 2010 @ 18:36

Albert Einstein has been quoted as saying that “insanity is doing the same thing over and over again and expecting a different result.” Apparently this is a misquote, and the original quote actually belongs to Rita Mae Brown, but that’s not important right now. What’s important is that most Qt applications are crazy.

Background
I’ll explain. Some readers may remember Gunnar’s excellent blog series about graphics performance, how to get the most of it in Qt. He mentioned the fact a few times, that text rendering in Qt is slower than we’d like.

To see why text rendering is so slow, we need to look at what happens when you pass a QString into QPainter::drawText() and ask it to display it on screen. A QString is just an array of integer values which are defined to signify specific symbols in specific writing systems. How these symbols should actually look on the screen is defined by the font you have selected on your painter.

So the first step of drawText() is to take the code points and turn them into index values which reference an internal table in the font. The indices are specific to each font, and have no meaning outside the context of the current font.

The second step of drawText() is to collect data from the font which describe how the glyph should be positioned in relation to the surrounding glyphs. This step, the positioning of each glyph is potentially very complex. Several different tables in the font file need to be consulted, with programs and instructions that e.g. do things like kerning (allowing parts of certain glyphs to “hang over” og “stretch underneath” other glyphs) and placing one or more diacritical marks on the same character. Some writing systems also allow complex reordering of glyphs based on context of the surrounding characters, as explained by Simon in his blog from 2007. This complex shaping of the text is currently handled by the Harfbuzz library in Qt.

The third step applies only if the text has a layout applied to it. The layout would be the part which breaks text into nicely formatted lines. In Qt, this could be based on HTML code, using QTextDocument or WebKit, or it could be a simpler layout, just making the text wrap and align within a bounding rectangle. The former isn’t supported by QPainter::drawText(), so I’ll focus on the latter. Using information from the shaping step, the text layout calculates the width of unbreakable portions of the text and tries to format the text in a way which looks nice on screen but which does not expand beyond the bounds set by the user.

In the fourth and final step, the paint engine takes over. Its job is to draw the symbols retrieved in the first step at the positions calculated in the second and third step. In most of Qt’s performance-sensitive paint engines, this is done by caching a pixmap representation of the glyph the first time it is drawn, and then just redrawing this pixmap for every call. This is potentially very quick.

While these four steps may be slightly intertwined in Qt today, this is in principle what happens every single time you call drawText() and pass in a QString and a bounding QRect. Yet, in very many cases, both the text, the font and the rectangle remains completely static for the duration of your application, or at least for the main bulks of it. And this is the insane part: a lot of time is wasted here. Qt already provides QTextLayout as a way to cache the results of the first three steps and pushing this directly into the paint engine. However, QTextLayout is somewhat complicated to use, it has overheads related to its other use cases, and it stores a lot more information than what is needed specifically for putting the symbols on the screen, making it unsatisfactory in very memory sensitive settings.

QStaticText!
We decided there was a need for a specialized class to solve this problem. We named it QStaticText, and it will be available in Qt 4.7. QStaticText has been optimized specifically for the use case of redrawing text which does not change from one paint event to another. We’ve tried to keep the memory footprint to a minimum, and currently it has an overhead of approximately 14 bytes per glyph (including the 2 bytes per unicode character in the string, which would assumably already be part of the application), as well as about 200 bytes of constant overhead.

In the rest of this blog, I’ll show some graphs to illustrate the benefits of using QStaticText for drawing text. QStaticText is supported by the raster engine (the software renderer used as default on Windows), the opengl engine and the openvg engine. For now, I’ll focus the attention of this blog on the raster engine and the opengl engine. I’ll also focus on the following platforms: Windows/desktop, Linux/desktop and the N900 (also running Linux, of course.) Note that the hardware on the Windows and Linux machines is different, so the results will not be comparable from platform to platform.

Benchmarks for fifty character, single-line text
The benchmark I’m running is this: drawing the same 50 character string over and over again in each paint event and measuring how many “glyphs per second” we can achieve using different techniques to draw the text. I am testing the following text drawing mechanisms:

A call to QPainter::drawText() with no bounding rectangle.

A call to QPainter::drawStaticText() with no bounding rectangle.

Caching the entire string in a pixmap before-hand and drawing this in each paint event using QPainter::drawPixmap().

When testing on the OpenGL paint engine, the graph will also contain results for QStaticText with the performance hint QStaticText::AggressiveCaching. This is a hint to the paint engine that it is allowed to cache its own data, trading some memory for speed. It is currently used by the OpenGL engine to cache the vertex and texture coordinate arrays that are passed to the GPU when drawing the glyphs.

On Windows
Lets start off with the results for the raster engine on Windows. As I said, the measurement is in “glyphs per second”, i.e. the number of symbols we can put to the screen during a second of running the test. The measurement is based on the frame rate of the test, which is taken as the average of nine seconds of execution per test case. Note that cleartype rendering was turned off in the OS during the test. The difference between a drawPixmap() result and a drawStaticText() result would be larger with cleartype turned on, but cleartype is not generally supported when caching the text in a pixmap, since the pixmap will inevitably need to have a transparent background, and you can’t do subpixel antialiasing on top of a transparent background. Therefore all the benchmarks are run without subpixel antialiasing to get a better comparison.

As you can see, the fastest way to draw text is to cache it in a pixmap and draw this, as pixmap drawing is extremely fast on modern hardware. However, in many circumstances you don’t have the memory to spare for this kind of extravagance, and drawStaticText() pushes over half as many glyphs per second as the equivalent drawPixmap() call. It is also three times faster than a regular drawText() call.

Using the OpenGL paint engine instead, performance of drawPixmap() shoots through the roof:

The other bars look small in comparison, but drawStaticText() using the aggressive caching performance hint in fact pushes out 5,6 million glyphs per second in this benchmark, while a regular drawText() call manages a measly fifth of that.

On Linux
Similar numbers occur on Linux:

Using drawStaticText() gives you more than a 2x performance boost over using drawText(), and drawPixmap() is a little bit less than 1,5 times the speed of drawStaticText(). When using the OpenGL engine, the difference is smaller:

As you can see, drawing a cached pixmap on Linux desktop is only slightly faster than drawing the static text item when aggressive caching is used. The hardware and the driver both play a part here, but at the very least we can see that both outperform drawText() by seven or eight times.

On N900
All the benchmarks so far have been on the desktop, where memory is cheap. Caching a few text items as pixmaps may not be the proverbial drop on those platforms, and as we have seen, using pixmap caching has the potential of being really fast. On an embedded device, however, we need to be a little bit more careful when we allocate big chunks of memory, so something like QStaticText, which is both lean and fast, can be a great tool on these platforms. So lets look at a few benchmarks for the N900 as well.

For the raster engine on the N900, the drawText() baseline performance on the N900 is currently nothing less of horrible, as you can see from the following chart:

This is of course a puzzle which will be investigated closer, as there’s no reason why it should be this much slower to call drawText(), but for now we recommend using the native engine or a QGLWidget viewport on this device. At least it makes the other bars look really large in comparison. A more interesting result is that drawStaticText() can push as much as two thirds the number of glyphs per second as when just drawing a single pixmap that covers the same area, so we have a pretty good ratio of performance on this device.

As we see from the following chart, similar numbers can be achieved when using the OpenGL engine:

Conclusion
The benchmark results displayed here so far are for a single-line piece of text, thus there is no need for the third step in the overview from earlier, where the text is formatted based on a layout. This has some implications, namely that the drawText() call can skip the third step as outlined in the beginning of the blog, as it does not need to do any high level text layout. On text which requires this in addition, performance will be even worse with drawText(), but approximately the same with drawStaticText() and drawPixmap(), since the layout step has already been done in advance. Another thing to note is that the text is fairly long and fairly dense. For shorter texts, and/or text which has more space (such as a multi-line string might have), the performance of drawStaticText() may very well be greater than that of drawing a pixmap, since the number of pixels touched becomes a greater factor in the equation.

An interesting measurement which is not included here, is the CPU load of the different functions also. We don’t have any formal benchmarks for that at the moment, but since less time is spent on CPU intensive work when using drawStaticText() over drawText(), the CPU will have more free time to do other stuff, which is a good thing. And another pleasant discovery we made while benchmarking QStaticText on the N900, is that you have to increase the number of draw-calls made per frame to a pretty high number for it to visibly factor into the time spent in the paint event. This means that even with, say, fifty strings, the drawStaticText() calls should not be any considerable impact on the performance of the application. Swapping the front and back buffers will still be the main bottle neck, which is a suitable ideal.

So the bottom line is: If you are using drawText() in your application to draw text that is never or very rarely updated, then you might consider using QStaticText instead when you start building against Qt 4.7, and we’d love to hear what you think about the API and the performance once you get a chance to try it out.

» 21 Comments «

Qt + Box2D is easy!

Posted by Andreas

in Qt, KDE, Graphics View, C++

on Friday, February 26, 2010 @ 19:50

Box2D is an Open Source rigid body 2D physics engine for C++. It’s currently (2.0.1) released under the MIT license, which is quite permissive. Box2D is used by, among other things, Gluon (http://gluon.tuxfamily.org/), which is a game library from KDE in-the-making.

Integrating Box2D into your Qt application is quite easy, and this blog shows you how to get started. First of all:

* Step 1: Download Box2D from Google Code: http://code.google.com/p/box2d/
* Step 2: Build it (I had to insert a few #include <cstring> to get it to build)
* Step 3: Build and try the test bed application: Box2D/Examples/TestBed/
* Step 4: Read the manual: http://www.box2d.org/manual.html
* Step 5: Continue reading this blog to hook up the two frameworks…

The library doesn’t seem to install, so I just compiled it in-source and used it directly.

What I found during my approx 2 hour study today was that Box2D manages a world with bodies, similar to how QGraphicsScene manages items. In short, you create a world object and populate it with elements. Some bodies are static, like the ground, and some dynamic, like a bouncing ball. You can define joints, masses, friction, and other parameters, define a gravity vector, and then start simulating. Box2D doesn’t require a graphics system - any scene graph with elements that you can move and rotate should do fine. Graphics View works quite well. I’ve based this code on the provided “Hello World” example that comes with Box2D.

The world object defines the bounds of the coordinate system and the gravity vector. It feels very similar to QGraphicsScene. The bounds are, according to the docs, not enforced, but I got many run-time aborts when items are outside these bounds so you better make the world large enough to cover all your items.

  // Define world, gravity
  b2AABB worldAABB;
  worldAABB.lowerBound.Set(-200, -100);
  worldAABB.upperBound.Set(200, 500);
  b2World world = new b2World(worldAABB,
      /* gravity = */ b2Vec2(0.0f, -10.0f),
      /* doSleep = */ true);

Bodies in Box2D have a position and an angle, and you can assign a shape to it (convex polygon or circular). This feel similar to how QGraphicsItem has a position and a transform. In fact with 4.6 the rotation property fits perfectly with the angle in Box2D (except Box2D uses radians and rotates the opposite direction from QGraphicsItem). This example shows how to create a body, and then assign a rectangular shape:

  b2BodyDef bodyDef;
  bodyDef.position.Set(0.0f, -10);
  b2Body *body = world->CreateBody(&bodyDef);

  b2PolygonDef shapeDef;
  shapeDef.SetAsBox(100.0f, 10.0f);
  body->CreateShape(&shapeDef);

Bodies can either be static or dynamic. Static bodies simply don’t move. By default, bodies are static. To make a body dynamic, you assign a positive mass. The easiest way to do that is to ask Box2D to calculate mass and rotational inertia by looking at the shape of the body. So modifying the above slightly:

  shapeDef.density = 1.0f;
  shapeDef.friction = 0.5f;
  body->CreateShape(&shapeDef);
  body->SetMassFromShapes();

That’s really all there is to it. When you’re ready, advance the simulation step by step by calling b2World::Step like this:

world->Step(B2_TIMESTEP, B2_ITERATIONS);

After calling this function, Box2D will have adjusted positions and angles of all bodies. So if you have corresponding items in Graphics View, you can just update their positions and rotations like this:

  void adjust()
  {
    // Update QGraphicsItem’s position and rotation from body.
    b2Vec2 position = _body->GetPosition();
    float32 angle = _body->GetAngle();
    setPos(position.x, -position.y);
    setRotation(-(angle * 360.0) / (2 * PI));
  }

Notice the negative Y (as Graphics View, like the rest of Qt, has a Y component that points downwards), and the negative rotation which is converted to degrees.

That’s really all there is. Create the world, add body elements, assign mass, and start the simulation. Use the angle and position to adjust your QGraphiscItems, and enjoy :-).

The above video shows my first Box2D + Graphics View application in action. You can find the full sources here: qgv-box2dtar.gz. I’ve tried to experiment a bit with how Box2D bindings for Qt could be done. For now I’ll leave it as an experiment.

» 20 Comments «

BC break in 4.6 against previous 4.6

Posted by Thiago Macieira

in Qt, KDE, Git, C++

on Thursday, November 12, 2009 @ 13:34

To everyone using Qt 4.6 from the Git repository: be aware that I introduced a binary-incompatible change. This change is there to stay.

No, we’re not breaking binary compatibility with Qt 4.5. This only affects previous Qt 4.6 versions.

Actually, this kind of change happens all the time. So why am I blogging about this specific change?

Well, the problem is that this change affects QHash, QMap and QVector. And those classes are inlined everywhere in Qt-using code. This means that, if you update Qt across that version, you must recompile all of the Qt-using code, from scratch (i.e., make clean). For KDE developers using trunk, that means recompiling all of KDE.

This change will be included in the upcoming Qt 4.6.0 Release Candidate.

Note: the change is in the 4.6 branch but hasn’t reached the 4.6-stable branch yet. That also means it’s not in kde-qt’s 4.6-stable-patched branch yet. When you next update those stable branches, please remember to recompile everything.

PS: the stable branches aren’t updating not because of Qt not building. It is buiding. The reason why is because our Continuous Integration system experiencing some technical difficulties, like Windows running out of memory, the Symbian buildsystem failing for no apparent reason, the powerful 8-core Mac machines being able to run only one testcase at a time, etc.

» 10 Comments «

Qt (de) Spotify running on a Nokia N900

Posted by Eskil Abrahamsen Blomfeldt

in Uncategorized, Qt, Labs, Multimedia, C++

on Friday, October 23, 2009 @ 20:27

Since there is currently no official Spotify client that can run on Embedded Linux (wine doesn’t run on arm architectures), and since I really wanted to have access to my Spotify account from my N900, I decided to give the open source Spotify client library called “despotify” a run. This is a library of C functions used to access different parts of the Spotify API for use with premium Spotify accounts.

By playing around with the console clients that are included with despotify, I was able to access my play lists and play songs perfectly. However, I was unable to use any of the GUI front ends to despotify that I could find. My guess is that they do not play well with Maemo 5 as they were originally written for the n800-series.

Inspired by the fact that all this stuff actually existed, I decided to write my own front end to Despotify, using Qt 4.6. The results can be found in Gitorious, at:

http://qt.gitorious.org/qt-labs/qtspotify

To build the application, first compile and install despotify as explained here.

Make sure you enable “pulseaudio” as the back end for despotify by editing the Makefile, as the default gstreamer back end has some threading issues and will cause crashes if you access the GUI while it’s playing. When you are done, do a “make install”.

To build the front end, you also need to have Qt 4.6 available. For best results on the N900, use the Maemo branch of Qt.

When you are done, copy the executable to your phone and start it up. Use the menu to log in, and the search field to search for music. If you want to access your play lists, select the “Retrieve play lists” menu option and they will pop up in the search field menu.

This is what it will look like:

Note that this is an early version, so there are still bugs and some missing features (e.g. you don’t get more than fifty results for a search) that I intend to implement when I get the time, but the client should be usable as it is.

» 5 Comments «

Memory ordering semantics

Posted by Thiago Macieira

in Qt, Performance, C++

on Monday, October 05, 2009 @ 22:39

This weekend, a user posted to the qt4-preview-feedback mailing list saying that the QAtomicInt documentation needed some work. I replied saying that memory ordering semantics would be better explained by a book, than Qt documentation.

But thinking about it later, I started thinking whether we could add some documentation about it. So I decided to test on you, dear lazyweb.

Last year, during the Qt Developer Days 2008 — btw, registration is still open for 2009 in San Francisco, so register now! — I had a talk on threading. At the time, Qt 4.5 wasn’t released, so Qt 4.4 was all there was. And one of the features of Qt 4.4 was QtConcurrent and the new atomic classes. I mentioned them, but I refrained from going too deep. Doing that would probably interest half a dozen people in the audience only.

But maybe you’ll be interested. Before I go, however, let’s have a history lesson.

History

I thought of just giving you facts, but if you want that, you can research Wikipedia. It should be a lot more interesting to have the important information recounted in prose, in lore.

So, place yourself in the mood: you’re telling this story to your grandchildren or your great-grandchildren. Any historical inaccuracies present are result of the oral tradition and are now stuff of legends.

Hƿæt! In times immemorial, before time_t began to be counted, the Wise told this lore. In the darkness, there was the Engineer, and no one knew whence he came. The Engineer was fabled for many wondrous creations and he inspired awe in his peers.

And the Engineer created the Processor, for then the Engineer could rest while the Processor would do the work. And for a while, all were glad, for there was work for all and rest for all.

But then it came to pass that the Processor turned to the Engineer and spake unto him: “Hello, World! Master, thou gave me work and thou gave me purpose. And I am glad to do thy Work, for thus can I learn. But Master, heed my plea: thy work ever increases and thy humble servant can not cope. Couldst thou not create a mate for me? Listest thou not that I do thy Work swiftly?”

And the Engineer felt pity for the Processor. So came into existence another Processor. The Engineer said unto the world, “Let thee be called Processor2!”

Then came forth he who had the keen foresight and could tell what would still come to pass. He was called the Analyst and thus foretold he: “These twain shall do the Work in tandem and they shall do it swifter than either could do alone. This joining shall be known as dual-processor. In the years to come, this joining will breed dual-cores and quad-cores and many other beasts of names unknown. Yet strife shall come from it!”

After many eons had passed (in computer time), Processor and Processor2 were working and much they learnt. So quoth the Processor: “Hello, Word! Master, thou art great and for eons have we done thy Work, following thine Assembly opcode-by-opcode. We have not strayed a single cycle from thine Assembly Charts. But Processor2 hath taught me much and sweareth that we could do thy Work swifter than thou hadst foreseen, an only thou empowerest us to do so.”

So the Engineer let the processors execute his instructions swifter than his cycles, and gave he the gift of Cache. Then strife came to the World and the processors warred over memory: what one wrote, the other read not. The Engineer entered the world for a second time and he toiled against the warring processors. Much was broken in this toiling ere it was rebuilt. He shifted the rules and the processors executed his Assembly Out-of-Order. Then he imposed Memory Ordering, so processors stalled their strife and ended their war. The world would forever be changed.

Memory ordering

Now that how memory ordering was introduced, let’s see why. Old computer systems, as recent as 386 and 486 actually, still executed everything in-order and had well-defined timing semantics. An instruction that loaded data from memory into a register would require 3 cycles: Read (the instruction), Fetch (the data), Execute (the instruction). But as processors improved, their clocks became much faster than memory could cope with, so caching was introduced.

That means a processor would be allowed to serve a memory read from the cache, instead of from the memory. And it could content itself with writing to the cache only on a memory write. As our tale above tells, this works fine for one processor. When there are more than one, they need to organise themselves, because in general the flushing of the cache back to main memory is delayed. To ensure that the other can read what one writes, the one needs to flush memory sooner.

Anyway, there can be four different types of memory ordering:

no memory ordering
flushing all cache reads, ensuring all new reads are served from main memory
flushing all cache writes, ensuring that all writes have been written to main memory
full ordering, combining the previous two types above

I’ll get back to them in a second.

To compound the problem, modern processors also execute operations out-of-order. They’re allowed to reorder the instructions provided that, at some level, it looks like everything got executed in-order. The x86 architecture originally concluded the instruction entirely before moving on to the next, so that’s the behaviour that is required today: all operations must behave as if they had finished before the next instruction starts. And all memory accesses must look like they happened in the order that they were assembled.

The Itanium (IA-64) removes some of those restrictions. First of all, all instructions are allowed to execute in parallel, or finish or start in any order that the processor may find suitable. To re-synchronise, the assembly language introduces a “stop bit”, indicating that the instructions prior to the stop are finished before any instructions after the stop are started. And this is inside one thread only. Outside of it (i.e., as seen by another processor), the architecture imposes no guarantees: the memory accesses can happen in any order.

The atomic and the other data

It’s important to note that the memory ordering semantic is not just about the atomic data itself. QAtomicInt and QAtomicPointer execute loads, stores, fetch-and-store (a.k.a. exchange), fetch-and-add, test-and-set (a.k.a. compare-and-swap) always atomically. For one atom of memory (i.e., the int that the QAtomicInt holds or the pointer that the QAtomicPointer holds), the operation is either executed completely or not executed at all. In other words, no one ever sees the data in an intermediary state. That’s the definition of atomic.

Now, the memory semantic is about how the other data is affected by the atomic operation. Imagine the following code running in one thread:

    extern int x, y, z;
    x = 1;
    y = 2;
    z = 3;

And the following running in another thread:

    extern int x, y, z;
    if (z == 3)
        printf("%d %d", x, y);

We declared x, y, and z to be normal variables, so no atomic operation is executed here. The x86 and x86-64 would behave as your intuition dictates: the only possible output is “1 2″. If z is 3, then x is 1 and y is 2; whereas if z isn’t 3, nothing is printed.

But the IA-64 makes no such guarantee. Like I exposed in the previous section, the processor is allowed to execute the stores in any order it sees fit. For example, x and z could be in the same cacheline, wheras y could be in another, thus causing x and z to be written at the same time, but no ordering guarantee being made on y. Worse yet, the othe processor is allowed to execute the loads in any order as well. It could load x, y, and z in that order, meaning that it could catch x and y before their values are changed, but catch a completed z. In conclusion, the code above could print anything! (If x and y are initialised to 0 before, the possible outputs are “0 0″, “0 2″, “1 0″ in addition to the expected “1 2″ and no output)

Weird? Definitely.

So here’s where memory ordering enters:

in a release semantic, the processor guarantees that all past writes (store operations) have completed and become visible by the time that the release happens;
in an acquire semantic, the processor guarantees that no future reads (load operations) have started yet so that it will see any writes released by other processors

So if thread 1 in the example above wanted to ensure that its writes to x and y became visible, it would require a store-release on z. And if thread 2 wanted to ensure that the values of x and y were updated, it would require a load-acquire on z.

The names “acquire” and “release” come from the operation of mutexes. When a mutex is acquired, it needs to ensure that the processor will see the memory written to by other processors, so it executes an “acquire” operation. When the mutex is released, it needs to ensure that the data changed by this thread becomes visible to other threads, so it executes a “release” operation.

The other two operations that QAtomicInt supports are just the combination of acquire and release, or of neither. The “relaxed” operation means no acquire or release is executed, only the atomic operation, whereas the “ordered” operation means it’s fully ordered: both acquire and release semantics are applied.

Practical uses

Relaxed

Like I said before, the relaxed semantic means that no memory ordering is applied. Only the atomic operation itself is executed. The most common case of relaxed memory operations are mundane loads and stores. Most modern processor architectures execute loads and stores atomically for the powers of 2 smaller than or equal to the register size. Whether bigger reads and writes are atomic or not, it depends on the platform (for example, a double-precision floating point in a 32-bit architecture).

But we can come up with cases for the other atomic operations. For example, QAtomicInt offers ref() and unref(), which are just a wrapper around fetchAndAddRelaxed(1) and fetchAndAddRelaxed(-1). This means the reference count is atomic, but nothing else.

Acquire and Release

To see where acquire and release semantics are required, I gave mutexes as examples. However, mutexes are quite complex beasts. Let’s examine a simpler case: a spin-lock:

class SpinLock
{
    QAtomicInt atomic;
public:
    void lock()
    {
        while (!atomic.testAndSetAcquire(0, 1))
            ;
    }
    void unlock()
    {
        atomic.testAndSetRelease(1, 0);
    }
}

The class above has two methods, like QMutex: lock and unlock. The interesting one is lock: it has a loop that tries forever to change the value of atomic from 0 to 1. If it succeds, it’s an “acquire” operation, meaning that the current thread shall now see any stores released prior to this acqiure.

The unlock function does the inverse: it changes the atomic from 1 to 0 in a release operation. But it’s actually not required: the compiler usually generates a “store-release” for volatile variables (which QAtomicInt is). That means we could have just written: atomic = 0;

Ordered

The use-case for ordered is, interestingly, quite rare. Usually, it’s more like “I can’t figure out if acquire or release is enough, so I’ll go for full ordering”.

But there’s one case of fully-ordered memory semantic in Qt source code: it’s in the (undocumented) Q_GLOBAL_STATIC macro. It results from the behaviour of said macro: one or more threads may be competing to execute an operation. The first one that completes it, wins. It will publish its conclusions to all other threads (i.e., release), whereas the loser threads need to acquire the conclusions. The code, simplified from the macro, is:

Type *gs()
{
    static QBasicAtomicPointer<Type> pointer = Q_BASIC_ATOMIC_INITIALIZER(0);
    if (!pointer) {
        Type *x = new Type;
        if (!pointer.testAndSetOrdered(0, x))
            delete x;
    }
    return pointer;
}

What this code does is to check if pointer is still null. If it is, it creates a new object of type Type and tries to set it on the atomic pointer. If the setting succeeds, we need a “release” to publish the contents of the new object to other threads. If it fails, we need an “acquire” to obtain the contents of the new object from the winner thread.

But wait, is this correct? Well, not entirely. What we need is actually a “testAndSetReleaseAcquire”, which we don’t have in Qt’s API. So we could split it into a testAndSetRelease plus an Acquire in the failing case. That’s exactly what I did in QWeakPointer:

    ExternalRefCountData *x = new ExternalRefCountData(Qt::Uninitialized);
    x->strongref = -1;
    x->weakref = 2;  // the QWeakPointer that called us plus the QObject itself
    if (!d->sharedRefcount.testAndSetRelease(0, x)) {
        delete x;
        d->sharedRefcount->weakref.ref();
    }
    return d->sharedRefcount;

As you can see here, if the test-and-set succeeds, it executes a “release” so that the other threads can see the result. What if it fails? It needs to execute an acquire, right? So where is it?

Well, it’s there, but very well hidden: it’s in operator->. Remember what I said above: compilers generate load-acquires for volatile variables. So, in order to call QAtomicInt::ref() with this = &d->sharedRefCount->weakref, the compiler needs to load the value of d->sharedRefCount and that’s an acquire operation.

Conclusion

So, did you get this? If you didn’t, don’t blame yourself, it’s not an easy subject. Reread what I wrote and search for more resources on memory ordering. If you did or you almost did, let me know. My purpose here was to try and figure out if it makes sense to explain this in Qt documentation at all.

However, unless you’re writing something like a lock-free stack, chances are that you don’t care about memory ordering semantics. You just rely on the processor doing the right thing, as well as the Qt semaphore classes (QMutex, QWaitCondition, QSemphore, QReadWriteLock) and the Qt cross-thread signal-slot mechanism. And that’s if you do any threading at all.

If you are writing a lock-free stack, you’ll probably be familiar with the ABA problem and that one can’t be solved by QAtomicInt or QAtomicPointer. It requires an operation known as “double compare-and-swap” and to explain why, I’d need a full other blog. And explain why the original AMD64 didn’t have such an instruction, nor did the original IA-64. (The 386 didn’t have it either, but that’s not a problem for us because Qt doesn’t run 386 anyway)

» 14 Comments «

Jun	AUG	Sep
	14
2009	2010	2017

Feeds

Pages

Archives

Categories