? ?

Entries by tag: kernel

It was almost 10 years ago that I organized a kerneltrap.org interview with our at-that-time kernel team leader Andrey Savochkin, which was published on April 18, 2006. As years go by, kerneltrap.org is no more, Andrey moved on to got a PhD in Economics and is now an Assistant Professor, while OpenVZ is still here. Read on for this great piece of memorabilia.


Andrey Savochkin leads the development of the kernel portion of OpenVZ, an operating system-level server virtualization solution. In this interview, Andrey offers a thorough explanation of what virtualization is and how it works. He also discusses the differences between hardware-level and operating system-level virtualization, going on to compare OpenVZ to VServer, Xen and User Mode Linux.

Andrey is now working to get OpenVZ merged into the mainline Linux kernel explaining, "virtualization makes the next step in the direction of better utilization of hardware and better management, the step that is comparable with the step between single-user and multi-user systems." The complete OpenVZ patchset weighs in at around 70,000 lines, approximately 2MB, but has been broken into smaller logical pieces to aid in discussion and to help with merging.

Jeremy Andrews: Please share a little about yourself and your background...

Andrey Savochkin: I live in Moscow, Russia, and work for SWsoft. My two major interests in life are mathematics and computers, and I was unable to decide for a long time which one I preferred.

I studied in Moscow State University which has a quite strong mathematical school, and got M.Sc. degree in 1995 and Ph.D. degree in 1999. The final decision between mathematics and computers came at the time of my postgraduate study, and my Ph.D. thesis was completely in the computer science area, exploring some security aspects of operating systems and software intended to be used on computers with Internet access.

Jeremy Andrews: What is your involvement with the OpenVZ project?

Read more...Collapse )
Last time we released a few kernel updates with security fixes:



  • Critical security issue was fixed in OpenVZ kernel 2.6.32-042stab108.7


  • OpenVZ kernel team discovered security issue that allows privileged user inside
    container to get access to files on host. All kind of containers affected: simfs, ploop and vzfs. Affected all kernels since 2.6.32-042stab105.x

    Note: RHEL5-based kernels 2.6.18, Red Hat and mainline kernels are not affected.

  • 8 security issues fixed in OpenVZ kernel 2.6.32-042stab108.8



    • CVE-2014-3184 HID: off by one error in various _report_fixup routines

    • CVE-2014-3940 missing check during hugepage migration

    • CVE-2014-4652 ALSA: control: protect user controls against races & memory disclosure

    • CVE-2014-8133 x86: espfix(64) bypass via set_thread_area and CLONE_SETTLS

    • CVE-2014-8709 net: mac80211: plain text information leak

    • CVE-2014-9683 buffer overflow in eCryptfs

    • CVE-2015-0239 kvm: insufficient sysenter emulation when invoked from 16-bit code

    • CVE-2015-3339 kernel: race condition between chown() and execve()



    Note: RHEL5-based kernels 2.6.18 are not affected.

    It is quite critical to install latest OpenVZ kernel to protect your systems.
    Please reboot your nodes into fixed kernels or install live patches from Kernel Care.

OpenVZ past and future

Looking forward to 2015, we have very exciting news to share on the future on OpenVZ. But first, let's take a quick look into OpenVZ history.

Linux Containers is an ancient technology, going back to last century. Indeed it was 1999 when our engineers started adding bits and pieces of containers technology to Linux kernel 2.2. Well, not exactly "containers", but rather "virtual environments" at that time -- as it often happens with new technologies, the terminology was different (the term "container" was coined by Sun only five years later, in 2004).

Anyway, in 2000 we ported our experimental code to kernel 2.4.0test1, and in January 2002 we already had Virtuozzo 2.0 version released. From there it went on and on, with more releases, newer kernels, improved feature set (like adding live migration capability) and so on.

It was 2005 when we finally realized we made a mistake of not employing the open source development model for the whole project from the very beginning. This is when OpenVZ was born as a separate entity, to complement commercial Virtuozzo (which was later renamed to Parallels Cloud Server, or PCS for short).

Now it's time to admit -- over the course of years OpenVZ became just a little bit too separate, essentially becoming a fork (perhaps even a stepchild) of Parallels Cloud Server. While the kernel is the same between two of them, userspace tools (notably vzctl) differ. This results in slight incompatiblities between the configuration files, command line options etc. More to say, userspace development efforts need to be doubled.

Better late than never; we are going to fix it now! We are going to merge OpenVZ and Parallels Cloud Server into a single common open source code base. The obvious benefit for OpenVZ users is, of course, more features and better tested code. There will be other much anticipated changes, rolled out in a few stages.

As a first step, we will open the git repository of RHEL7-based Virtuozzo kernel early next year (2015, that is). This has become possible as we changed the internal development process to be more git-friendly (before that we relied on lists of patches a la quilt but with home grown set of scripts). We have worked on this kernel for quite some time already, initially porting our patchset to kernel 3.6, then rebasing it to RHEL7 beta, then final RHEL7. While it is still in development, we will publish it so anyone can follow the development process.

Our kernel development mailing list will also be made public. The big advantage of this change for those who want to participate in the development process is that you'll see our proposed changes discussed on this mailing list before the maintainer adds them to the repository, not just months later when the the code is published and we'll consider any patch sent to the mailing list. This should allow the community to become full participants in development rather than mere bystanders as they were previously.

Bug tracking systems have also diverged over time. Internally, we use JIRA (this is where all those PCLIN-xxxx and PSBM-xxxx codes come from), while OpenVZ relies on Bugzilla. For the new unified product, we are going to open up JIRA which we find to me more usable than Bugzilla. Similar to what Red Hat and other major Linux vendors do, we will limit access to security-sensitive issues in order to not compromise our user base.

Last but not least, the name. We had a lot of discussions about naming, had a few good candidates, and finally unanimously agreed on this one:

Virtuozzo Core


Please stay tuned for more news (including more formal press release from Parallels). Feel free to ask any questions as we don't even have a FAQ yet.

Merry Christmas and a Happy New Year!

On kernel branching

This is a topic I always wanted to write about but was afraid my explanation would end up very cumbersome. This is no longer the case as we now have a picture that worth a thousand words!

The picture describes how we develop kernel releases. It's bit more complicated than the linearity of version 1 -> version 2 -> version 3. The reason behind it is we are balancing between adding new features, fixing bugs, and rebasing to newer kernels, while trying to maintain stability for our users. This is our convoluted way of achieving all this:

kernel_tree-2.6.32-x

As you can see, we create a new branch when rebasing to a newer upstream (i.e. RHEL6) kernel, as regressions are quite common during a rebase. At the same time, we keep maintaining the older branch in which we add stability and security fixes. Sometimes we create a new branch to add some bold feature that takes a longer time to stabilize. Stability patches are then forward-ported to the new branch, which is either eventually becoming stable or obsoleted by yet another new one.

Of course there is a lot of work behind these curtains, including rigorous internal testing of new releases. In addition to that, we usually provide those kernels to our users (in rhel6-testing repo) so they could test new stuff before it hits production servers, and we can fix more bugs earlier (more on that here). If you are not taking part in this testing, well, it's never late to start!

An interview with ANK

This is a rare interview with the legendary Alexey Kuznetsov (a.k.a. ANK), who happen to work for Parallels. Alan Cox once said he had thought for a long time that "Kuznetsov" is a collective name for a secret group of Russian programmers -- because no single man can write so much code at once.

An interview is taken by lifehacker.ru and is part of "work places" series. I tried to do my best to translate it to English, but it's far from perfect. I guess this is still a very interesting reading.




Q: Who are you and what you do?

Since mid-90s I was one of Linux maintainers. Back then all the communication was done via conferences and Linux mailing lists. Pretty often I was aggressively arguing with someone there, don't remember for which reasons. Now it's fun to recall. Being a maintainer, I wasn't just making something on my own, but had to control others. Kicking out those who were making rubbish (from my point of view), and supporting those who were making something non-rubbish. All these conflicts, they were exhausting me. Since some moment I started noticing I am becoming "bronzed" [Alexey is referring to superiority complex -- Kir]. You said or did some crap, and then learn that this is becoming the right way now, since ANK said so.

I started to doubt, maybe I am just using my authority to maintain status quo. Every single morning started with a fight with myself, then with the world. In 2003 I got fed by it, so I went away from public, and later switched to a different field of knowledge. At that time I started my first project in Parallels. The task was to implement live migration of containers, and it was very complicated.

Now in Parallels we work on Parallels Cloud Storage project, developing cluster file systems for storing virtual machine images. The technology itself is a few years old already, we did a release recently, and are now working on improving it.

Q: How does your workplace look like?

My workplace is a bunch of computers. But I only work on a notebook, currently it's Lenovo T530. Other computers here are used for various purposes. This display, standing here, I never use it, nor this keyboard. Well, only if something breaks. Here we have different computers, including a Power Mac, an Intel and an AMD. I was using those in different years for different experiments. Once I needed to create a cluster of 3 machines right here at my workplace. One machine here is really old, and its sole purpose is to manage a power switch, so I can reboot all the others when working remotely from home. Here I have two Mac Minis and a Power Mac. They are always on, but I use them rarely, only when I need to see something in Parallels Desktop.

Q: What software do you use?

I don't use anything except for Google Chrome. Well, an editor and a compiler, if they qualify for software. I also store some personal data and notes in Evernote.

I only use a text console. For everything. In fact, on a newer notebooks, when the screen is getting better, the console mode is working worse and worse. So I am working in a graphical environment now, running a few full-screen terminals on top of it. It looks like a good old Unix console. So this is how I live, read email, work.

I do have a GMail account, I use it to read email from my phone. Sometimes it is needed. Or, when I see someone sent me a PDF, I have nothing else to do than to forward this email to where I can open that PDF. Same for PPT. But this is purely theoretical, in practice I never work with PPT.

I use Linux. Currently it is Fedora 13 -- not a newest one, to say at least. I am always using a version that was a base for a corresponding RHEL release. Every few years a new Red Hat [Enterprise Linux] is released, so I install a new system. When I do not change anything for a few years. Say, 5 years. I can't think of any new feature of an OS that would force me to update. I am just using the system as an editor, same as I have used it 20 years ago.

I have a phone, Motorola RAZR Maxx, running Android. I don't like iOS. You can't change anything in there. Not that I like customizations, I like a possibility to customize. I got a Motorola because I hate Samsung. This hatred is absolutely irrational. I had no happiness with any Samsung product, they either did't work for me or they break. I need a phone to make calls and check emails, that is all I need. Everything else is disabled -- to save the battery.

I am also reading news over RSS every day, like a morning paper. Now Feedly, before it was Google Reader, until they closed it. I have a list of bloggers I read, I won't mention their names. I am also reading Russian and foreign media. Lenta.ru, for example. There's that nice english-language service, News 360. It fits for what I like and gives me the relevant news. I am not able to say if it works or not, but the fact is, what it shows to me is really interesting to me. It was showing a lot of sports news at first, but then they disappeared.

I don't use instant messengers like Skype or ICQ, it's just meaningless. If you need something, write an email. If you need it urgently, call. Email and phone covers everything.

Speaking of social networks, I do have a Facebook account with two friends -- my wife and my sister. I view this account only when they post a picture, I don't wander there for no reason.

Q: Is there a use for paper in your work?

It's a mess. I don't have a pen, so when I would need it I could not find it. If I am close to the notebook and I need to write something -- I write to a file. If I don't have a notebook around, I write to my phone. For these situations I recently started to use Google Keep, a service to store small notes. It is convenient so far. Otherwise I use Evernote. Well, I don't have a system for that. But I do have a database of everything on my notebook: perpetual emails, all the files and notes. All this stuff is indexed. Total size is about 10 gigabytes, since I don't have any graphics. Well, if we throw away all the junk from there, almost nothing will remain.

Q: Is there a dream configuration?

What I have here is more than enough for me. This last notebook is probably the best notebook I ever had.

I was getting used to it for a long time, swore a lot. I only use Thinkpads for a long time. They are similar from version to version, but each next one is getting bigger and heavier, physically. This is annoying. This model, they changed the keyboard. I had to get used to it, but now I realize this is the best keyboard I ever had. In general, I am pretty satisfied with ThinkPads. Well, if it would had a Retina screen and be just 1 kilogram less weight -- that would be ideal.

Yay to I/O limits!

Today we are releasing a somewhat small but very important OpenVZ feature: per-container disk I/O bandwidth and IOPS limiting.

OpenVZ have I/O priority feature for a while, which lets one set a per-container I/O priority -- a number from 0 to 7. This is working in a way that if two similar containers with similar I/O patterns, but different I/O priorities are run on the same system, a container with a prio of 0 (lowest) will have I/O speed of about 2-3 times less than that of a container with a prio of 7 (highest). This works for some scenarios, but not all.

So, I/O bandwidth limiting was introduced in Parallels Cloud Server, and as of today is available in OpenVZ as well. Using the feature is very easy: you set a limit for a container (in megabytes per second), and watch it obeying the limit. For example, here I try doing I/O without any limit set first:

root@host# vzctl enter 777
root@CT:/# cat /dev/urandom | pv -c - >/bigfile
 88MB 0:00:10 [8.26MB/s] [         <=>      ]
^C

Now let's set the I/O limit to 3 MB/s:

root@host# vzctl set 777 --iolimit 3M --save
UB limits were set successfully
Setting iolimit: 3145728 bytes/sec
CT configuration saved to /etc/vz/conf/777.conf
root@host# vzctl enter 777
root@CT:/# cat /dev/urandom | pv -c - >/bigfile3
39.1MB 0:00:10 [   3MB/s] [         <=>     ]
^C

If you run it yourself, you'll notice a spike of speed at the beginning, and then it goes down to the limit. This is so-called burstable limit working, it allows a container to over-use its limit (up to 3x) for a short time.

In the above example we tested writes. Reads work the same way, except when read data are in fact coming from the page cache (such as when you are reading the file which you just wrote). In this case, no actual I/O is performed, therefore no limiting.

Second feature is I/O operations per second, or just IOPS limit. For more info on what is IOPS, go read the linked Wikipedia article -- all I can say here is for traditional rotating disks the hardware capabilities are pretty limited (75 to 150 IOPS is a good guess, or 200 if you have high-end server class HDDs), while for SSDs this is much less of a problem. IOPS limit is set in the same way as iolimit (vzctl set $CTID --iopslimit NN --save), although measuring its impact is more tricky.</o>

Finally, to play with this stuff, you need:

  • vzctl 4.6 (or higher)
  • Kernel 042stab084.3 (or higher)
Note that the kernel with this feature is currently still in testing -- so if you haven't done so, it's time to read about testing kernels.

Is OpenVZ obsoleted?

Oh, such a provocative subject! Not really. Many people do believe that OpenVZ is obsoleted, and when I ask why, three most popular answers are:

1. OpenVZ kernel is old and obsoleted, because it is based on 2.6.32, while everyone in 2013 runs 3.x.
2. LXC is the future, OpenVZ is the past.
3. OpenVZ is no longer developed, it was even removed from Debian Wheezy.

Let me try to address all these misconceptions, one by one.

1. "OpenVZ kernel is old". Current OpenVZ kernels are based on kernels from Red Hat Enterprise Linux 6 (RHEL6 for short). This is the latest and greatest version of enterprise Linux distribution from Red Hat, a company who is always almost at the top of the list of top companies contributing to the Linux kernel development (see 1, 2, 3, 4 for a few random examples). While no kernel being ideal and bug free, RHEL6 one is a good real world approximation of these qualities.

What people in Red Hat do for their enterprise Linux is they take an upstream kernel and basically fork it, ironing out the bugs, cherry-picking security fixes, driver updates, and sometimes new features from upstream. They do so for about half a year or more before a release, so the released kernel is already "old and obsoleted", as it seems if one is looking at the kernel version number. Well, don't judge a book by its cover, don't judge a kernel by its number. Of course it's not old, neither obsoleted -- it's just more stable and secure. And then, after a release, it is very well maintained, with modern hardware support, regular releases, and prompt security fixes. This makes it a great base for OpenVZ kernel. In a sense, we are standing on the shoulders of a red hatted giant (and since this is open source, they are standing just a little bit on our shoulders, too).

RHEL7 is being worked on right now, and it will be based on some 3.x kernel (possibly 3.10). We will port OpenVZ kernel to RHEL7 once it will become available. In the meantime, RHEL6-based OpenVZ kernel is latest and greatest, and please don't be fooled by the fact that uname shows 2.6.32.

2. OpenVZ vs LXC. OpenVZ kernel was historically developed separately, i.e. aside from the upstream Linux kernel. This mistake was recognized in 2005, and since then we keep working on merging OpenVZ bits and pieces to the upstream kernel. It took way longer than expected, we are still in the middle of the process with some great stuff (like net namespace and CRIU, totally more than 2000 changesets) merged, while some other features are still in our TODO list. In the future (another eight years? who knows...) OpenVZ kernel functionality will probably be fully upstream, so it will just be a set of tools. We are happy to see that Parallels is not the only company interested in containers for Linux, so it might happen a bit earlier. For now, though, we still rely on our organic non-GMO home grown kernel (although it is already optional).

Now what is LXC? In fact, it is just another user-space tool (not unlike vzctl) that works on top of a recent upstream kernel (again, not unlike vzctl). As we work on merging our stuff upstream, LXC tools will start using new features and therefore benefit from this work. So far at least half of kernel functionality used by LXC was developed by our engineers, and while we don't work on LXC tools, it would not be an overestimation to say that Parallels is the biggest LXC contributor.

So, both OpenVZ and LXC are actively developed and have their future. We might even merge our tools at some point, the idea was briefly discussed during last containers mini-conf at Linux Plumbers. LXC is not a successor to OpenVZ, though, they are two different projects, although not entirely separate (since OpenVZ team contributes to the kernel a lot, and both tools use the same kernel functionality). OpenVZ is essentially LXC++, because it adds some more stuff that are not (yet) available in the upstream kernel (such as stronger isolation, better resource accounting, plus some auxiliary ones like ploop).

3. OpenVZ no longer developed, removed from Debian. Debian kernel team decided to drop OpenVZ (as well as few other) kernel flavors from Debian 7 a.k.a. Wheezy. This is completely understandable: kernel maintenance takes time and other resources, and they probably don't have enough. That doesn't mean though that OpenVZ is not developed. It's really strange to argue that, but please check our software updates page (or the announce@ mailing list archives). We made about 80 software releases this year so far. This accounts for 2 releases every week. Most of those are new kernels. So no, in no way it is abandoned.

As for Debian Wheezy, we are providing our repository with OpenVZ kernel and tools, as it was announced just yesterday.

Debian kernel packages

Good news, everyone!
Prof. Farnsworth


Many people use OpenVZ on Debian. In fact, Debian was one of the distribution that come with OpenVZ kernel and tools. Unfortunately, it's not that way anymore, since Debian 7 "Wheezy" dropped OpenVZ kernel. A workaround was to take an RPM-packaged OpenVZ kernel and convert it to .deb using alien tool, but the process is manual and somewhat unnatural.

Finally, now we have a working build system for Debian kernel packages, and a repository for Debian Wheezy with latest and greatest OpenVZ kernels, as well as tools. In fact, we have two: one for stable, one for testing kernels and tools. Kernels debs are built and released at the same time as rpms. Currently we have vzctl/vzquota/ploop in 'wheezy-test' repository only -- once we'll be sure they work as expected, we will move those into stable 'wheezy' repo.

To enable these repos:

cat << EOF > /etc/apt/sources.list.d/openvz.list
deb http://download.openvz.org/debian wheezy main
deb http://download.openvz.org/debian wheezy-test main
EOF
apt-get update


To install the kernel:
apt-get install linux-image-openvz-amd64

More info is available from https://wiki.openvz.org/Installation_on_Debian and http://download.openvz.org/debian/

on testing kernels

Currently, our best kernel line is the one that is based on Red Hat Enterprise Linux 6 kernels (RHEL6 for short). This is our most feature-reach, up-to-date yet stable kernel -- i.e. the best. Second-best option is RHEL5-based kernel -- a few years so neither vSwap nor ploop, but still good.

There is a dilemma of either releasing the new kernel version earlier, or delay it for more internal testing. We figured we can do both! Each kernel branch (RHEL6 and RHEL5) comes via two channels -- testing and stable. In terms of yum, we have four kernel repositories defined in openvz.repo file, their names should be self-explanatory:

* openvz-kernel-rhel6
* openvz-kernel-rhel6-testing
* openvz-kernel-rhel5
* openvz-kernel-rhel5-testing

The process of releasing kernels is the following: right after building a kernel, we push it out to the appropriate -testing repository, so it is available as soon as possible. We when do some internal QA on it (that can either be basic or throughout, depending on amount of our changes, and whether we did a rebase to newer RHEL6 kernel). Based on QA report, sometimes we do another build with a few more patches, and repeat the process. Once the kernel looks good to our QA, we put it from testing to stable. In some rare cases (such as when we do one simple but quite important fix), new kernels go right into stable.

So, our users can enjoy being stable, or being up-to-the-moment, or both. In fact, if you have more than a few servers running OpenVZ, we strongly suggest you to dedicate one or two boxes for running -testing kernels, and report any bugs found to OpenVZ bugzilla. This is good for you, because you will be able to catch bugs early, and let us fix them before they hit your production systems. This is good for us, too, because no QA department is big enough to catch all possible bugs in a myriad of hardware and software configurations and use cases.

Enabling -testing repo is easy: just edit openvz.repo, setting enabled=1 under an appropriate [openvz-kernel-...-testing] section.

ploop snapshots and backups

OpenVZ ploop is a wonderful technology, and I want to share more of its wonderfulness with you. We have previously covered ploop in general and it's write tracker feature to help speed up container migration in particular. This time, I'd like to talk about snapshots and backups.

But let's start with yet another ploop feature -- it's expandable format. When you create a ploop container with say 10G of disk space, ploop image is just slightly larger than the size of actual container files. I just created centos-6-x86 container -- ploop image size is 747M, and inside CT df shows that 737M is used. Of course, for empty ploop image (with a fresh filesystem and zero files) the ratio will be worse. Now, when CT is writing data, ploop image is auto-growing up to accomodate the data size.

Now, these images can be layered, or stacked. Imagine having a single ploop image, consisting of blocks. We can add another image on top of the first one, so that new reads will fall through to the lower image (because the upper one is empty yet), while new writes will end up being written to the upper (top) image. Perhaps this image will save some more words here:

ploop-stacked-images

The new (top) image is now accumulating all the changes, while the old (bottom) one is in fact the read-only snapshot of the container filesystem. Such a snapshot is cheap and instant, because there is no need to copy a lot of data or do other costly operations. Of course, ploop is not limited to only two levels -- you can create much more (up to 255 if I remember correctly, which is way above any practical limit).

What can be done with such a snapshot? We can mount it and copy all the data to a backup (update: see openvz.org/Ploop/backup). Note that such backup is very fast, online and consistent. There's more to it though. A ploop snapshot, combined with a snapshot of a running container in memory (also known as a checkpoint) and a container configuration file(s), can serve as a real checkpoint to which you can roll back.

Consider the following scenario: you need to upgrade your web site backend inside your container. First, you do a container snapshot (I mean complete snapshot, including an in-memory image of a running container). Then you upgrade, and realize your web site is all messed up and broken. Horror story, is it? No. You just switch to before-upgrade snapshot and keep working as it. It's like moving back in time, and all this is done on a running container, i.e. you don't have to shut it down.

Finally, when you don't need a snapshot anymore, you can merge it back. Merging process is when changes from an upper level are written to a lower level (i.e. the one under it), then the upper level is removed. Such merging is of course not as instant as creating a snapshot, but it is online, so you can just keep working while ploop is working with merge.

All this can be performed from the command line using vzctl. For details, see vzctl(8) man page, section Snapshotting. Here's a quick howto:

Create a snapshot:
vzctl snapshot $CTID [--id $UUID] [--name name] [--description desc] [--skip-suspend] [--skip-config]

Mount a snapshot (say to copy the data to a backup):
vzctl snapshot-mount CTID --id uuid --target directory

Rollback to a snapshot:
vzctl snapshot-switch CTID --id uuid

Delete a snapshot (merging its data to a lower level image):
vzctl snapshot-delete CTID --id uuid

Latest Month

July 2016
S M T W T F S
     12
3456789
10111213141516
17181920212223
24252627282930
31      

Syndicate

RSS Atom

Comments

Powered by LiveJournal.com
Designed by Tiffany Chow