Showing posts with label Linux. Show all posts
Showing posts with label Linux. Show all posts

Wednesday, 25 June 2014

Test drive of CoreOS from a Mac

Yes, I am still the LinuxJedi, and whilst I have Linux machines in my office for day-to-day work the main desktop I use is Mac OS.

I have recently been giving CoreOS a spin for HP's Advanced Technology Group and so for this blog post I'm going to run through how to setup a basic CoreOS cluster on a Mac.  These instructions should actually be very similar to how you would do it in Linux too.

Prerequisites


There are a few things you need to install before you can get things going:
  1. VirtualBox for the CoreOS virtual machines to sit on
  2. Vagrant to spin-up and configure the virtual machines
  3. Git to grab the vagrant files for CoreOS
  4. Homebrew to install 'fleetctl' (also generally really useful on a developer machine)
You then need to install 'fleetctl' which will be used to spin up services on the cluster:

$ brew update
$ brew install fleetctl

Configuring


Once everything is installed the git repository for the vagrant bootstrap is needed:

$ git clone git://github.com/coreos/coreos-vagrant/
$ cd coreos-vagrant/
$ cp config.rb.sample config.rb
$ cp user-data.sample user-data

The last two commands here copy the sample configuration files to configuration files we will use for Vagrant.

Edit the config.rb file and look for the line:

#$num_instances=1

Uncomment this and set the number to '3'.  When I was testing I also enabled the 'beta' channel in this file instead of the default 'alpha' channel.  This is entirely up to personal preference and both will work with these instructions at time of writing.

With your web browser go to https://discovery.etcd.io/new.  This will give you a token URL.  This URL will be used to help synchronise Etcd which is a clustered key/value store in CoreOS.  In the user-data YAML file uncomment the 'discovery' line and set the URL to the one the link above gave you.  It is important that you generate a new URL for every cluster and every time you burn down / spin up  a cluster on your machine.  Otherwise Etcd will not behave as expected.

Spinning Up


At this point everything is configured to spin-up the CoreOS cluster.  From the 'coreos-vagrant' directory run the following:

$ vagrant up

This may take a few minutes as it downloads the CoreOS base image and creates the virtual machines required.  It will also configure Etcd and setup SSH keys so that you can communicate with the virtual machines.

You can SSH into any of these virtual machines using the following and substituting '01' with the CoreOS instance you wish to connect to:

$ vagrant ssh core-01 -- -A

Using Fleet


Once you have everything up and running you can use 'fleet' to send systemd configurations to the cluster.  Fleet will schedule which CoreOS machine the configuration file will run on.  Fleet internally uses Etcd to discover all the nodes of your CoreOS cluster.

First of all we need to tell 'fleetctl' on the Mac how to talk to the CoreOS cluster.  This requires two things, the IP/port and the SSH key:

$ export FLEETCTL_TUNNEL=127.0.0.1:2222
$ ssh-add ~/.vagrant.d/insecure_private_key

For this example I'm going to spin up three memcached docker instances inside CoreOS, I'll let Fleet figure out where to put them only telling it that two memcached instances cannot run on the same CoreOS machine.  To do this create three identical files, 'memcached.1.service', 'memcached.2.service' and 'memcached.3.service' as follows:

[Unit]
Description=Memcached server
After=docker.service

[Service]
ExecStart=/usr/bin/docker run -p 11211:11211 fedora/memcached

[X-Fleet]
X-Conflicts=memcached*

This is a very simple systemd based script which will tell Docker in CoreOS to get a container of a Fedora based memcached and run it.  The 'X-Conflicts' option tells Fleet that two memcached systemd configurations cannot run on the same CoreOS instance.

To run these we need to do:

$ fleetctl start memcached.1.service
$ fleetctl start memcached.2.service
$ fleetctl start memcached.3.service

This will take a while and show as started whilst it downloads the docker container and executes it.  Progress can be seen with the following commands:

$ fleetctl list-units
$ fleetctl journal memcached.1.service

Once complete your three CoreOS machines should all be listening on port 11211 to a Docker installed memcached.

Thursday, 5 June 2014

Live Kernel Patching Video Demo

Earlier today I posted my summary of testing the live kernel patching solutions so far as part of my work for HP's Advanced Technology Group.  I have created a video demo of one of these technologies, kpatch.


This demo goes over the basics of how easy kpatch is to use.  The toolset for now will only work with Fedora 20  for now but judging by the documentation Ubuntu support is coming very soon.

Live Kernel Patching Testing

At the end of March I gave a summary of some of the work I was doing for HP's Advanced Technology Group on evaluating kernel patching solutions.  Since then I have been trying out each one whilst also working on other projects and this blog post is to show my progress so far.

After my first blog post there were a few comments as to why such a thing would be required.  I'll be the first to admit that when Ksplice was still around I felt it better to have a redundant setup which you could perform rolling reboots on.  That is still true in many cases.  But with OpenStack based public clouds a reboot means that customers sitting on top of that hardware will have their virtual machines shut down or suspended.  This is not ideal when with big iron machine you can have 20 or more customers to one server.  A good live kernel patching solution means that in such a scenario the worst case would be a kernel pause for a few seconds.

In all cases my initial testing was to create a patch which would modify the output of /proc/meminfo. This is very simple and doesn't go as far as testing what would happen patching a busy function under load but at least gets the basics out of the way.

Ksplice


As previously discussed, Ksplice was really the only toolkit to do live kernel patching in Linux for a long time.  Unfortunately since the Oracle acquisition there have been no Open Source updates to the Ksplice codebase.  In my research I found a Git repository for kSplice which had recent updates to it. I tried using the with Fedora 20 but my guess is either Fedora does something unique with the kernels or it is just too outdated for modern 3.x Linux kernels.  My attempts to compile the patches hit several problems inside the toolset which could not easily be resolved.

kGraft


After Ksplice I moved on to kGraft.  This takes on a similar approach to Ksplice but uses much newer techniques.  Of all the solutions I believe the technology in kGraft is the most advanced.  It is the only solution that is capable of patching a kernel without pausing it first.  To use this solution there is a requirement for now to use SUSE's kGraft kernel tree.  I tested this on an OpenSUSE installation so I could be close to the systems it was designed on.  Unfortunately whilst the internals are quite well documented, the toolset is not.  After several attempts I could not find a way to get this solution to work end-to-end.  I do think this solution has massive potential, especially if it gets added to the mainline kernel.  I hope in the future the documentation is improved and I can revisit this.

kpatch


Finally I tried Red Hat's kpatch.  This solution was released around the same time as kGraft and likewise is in a development stage with warnings that it probably shouldn't be used in production yet. The technology again is similar to Ksplice with more modern approaches but what really impressed me is the ease-of-use.  The kpatch source does not require compiling as part of the kernel, it can be compiled separately and is just a case of 'make && make install'.  The toolset will take a standard unified diff file and using this will compile the required parts of the kernel to create the patch module.  Whilst this sounds complex it is a single command to do this.  Applying the patch to the kernel is equally as easy.  Running a single command will do all the required steps and apply the patch.

Conclusion


If public clouds with shared hardware and going to continue to grow this should be a technology that is invested in.  I was very quickly drawn to Red Hat's solution because they have put a high priority on how the user would interact with the toolset.  I'm sure many Devops engineers will appreciate that.  What I would love to see is kGraft's technology combined with kpatch's toolset.  This will certainly be an interesting field to watch as both projects mature.

Monday, 31 March 2014

Live Kernel Patching Solutions

A large public cloud is quite a complex thing to manage, even with toolsets such as Openstack to deploy and manage it.  You often have many customers with compute instances on a single server so security is a high priority.  This also introduces an interesting problem, when you need to deploy a Linux Kernel security patch to the host (sometimes thousands of hosts) how do you do it with the least disruption?

One of the currently most accepted methods is to suspend the instances on a box, deploy the patch, reboot the host and resume the instances.  Whilst this works in many cases you have customers who have had their instances down for X minutes (even if it is planned and the customer is notified it can be inconvenient) and instance kernels that suddenly think "my clock just jumped, WTF just happened?".  This in-turn can cause problems for long running clients talking to servers as well as any time-sensitive applications.  Long running clients will often think they are still connected to the server when they really are not (there are ways around this with TCP Keepalive).

There are a couple of old solutions to this and a couple of new ones and as part of my work for HP's Advanced Technology Group I will be taking a deep dive into them in the coming weeks.  For now here is a quick summary of what is around:

kexec


This is probably the oldest technology on the list.  It doesn't quite fall under the "Live Kernel Patching" umbrella but is close enough that it warrants a mention.  It works by basically ejecting the current kernel and userspace and starting a new kernel.  It effectively reboots the machine without a POST and BIOS/EFI initialisation.  Today this only really shaves a few seconds off the boot time and can leave hardware in an inconsistent state.  Booting a machine using the BIOS/EFI sets the hardware up in an initialised state, with kexec the hardware could be in the middle of reading/writing data at the point the new kernel is loaded causing all sorts of issues.

Whilst this solution is very interesting, I personally would not recommend using it as during a mass deployment you are likely to see failures.  More information on kexec can be found on the Wikipedia entry.

Ksplice


Ksplice was really the first toolset to implement Live Kernel Patching.  It was created by several MIT students who subsequently spun-off a company from it which supplied patches on a subscription model.  In 2011 this company was acquired by Oracle and since then there have been no more official Open Source releases of the technology.  Github trees which are updated to work with current kernels still exist.

The toolset works by taking a kernel patch and converting it into a module which will apply the changes to functions to the kernel without requiring a reboot.  It also supports changes to the data structures with some additional developer code.  It does temporarily pause the kernel whilst the patch is being applied (a very quick process), but this is far better than rebooting and should mean that the instances do not need suspending.

kpatch


Both Red Hat and SUSE realised that a more modern Open Source solution is needed for the problem, and whilst SUSE announce their solution (kGraft) first, Red Hat's kpatch solution was the first to show actual code to the solution.

Red Hat's kpatch solution gives you a toolset which creates a binary diff of a kernel object file before and after a patch has been applied.  It then turns this into a kernel module which can be applied to any machine with the same kernel (along with kpatch's loaded module).  Like Ksplice, it does need to pause the kernel whilst patching the functions.  It also as-yet does not support changes to data structures.

It is still very early days for this solution but development is has been progressing rapidly.  I believe the intention is to create a toolset that will take a unified diff file and turn that into a kpatch module automatically.

For more information on this solution take a look at their blog post and Github repository.

kGraft


SUSE Labs announced kGraft earlier this year but only very recently produced code to show their solution.

From the documentation I've seen so far their solution appears to work in a similar way to Red Hat's but they have the unique feature which gives the ability for the patch to be applied to the kernel without pausing it.  Both the old and the replacement functions can exist at the same time, old executions will finish using the old function and new executions will use the new function.

This solution seems to have gone down the route of bundling their code on top of a Linux kernel git tree which means it took an entire night for me to download the git history.  I'm looking forward to digging through the code of this solution to see how it works.

The git tree for this can be found in the kgraft kernel repository (make sure to checkout the origin/kgraft branch after you have cloned it) and SUSE's site on the technology can be found here.

Summary


All three of the above solutions are very interesting.  Combined with a deployment technology such as Salt or Ansible it could mean the end to maintenance downtime for cloud compute instances.  As soon as I have done more research on the technologies I will be writing more details and hopefully even contributing where possible.

Tuesday, 11 February 2014

Why use double-fork to daemonize?

Something that I have been asked many times is why do Linux / Unix daemons double-fork to start a daemon?  Now that I am blogging for HP's Advanced Technology Group I can try to explain this as simply as I can.

Daemons double-fork

For those who were not aware (and hopefully are now by the title) almost every Linux / Unix service that daemonizes does so by using a double-fork technique.  This means that the application performs the following:

  1. The application (parent) forks a child process
  2. The parent process terminates
  3. The child forks a grandchild process
  4. The child process terminates
  5. The grandchild process is now the daemon

The technical reason


First I'll give the technical documented reason for this and then I'll break this down into something a bit more consumable.  POSIX.1-2008 Section 11.1.3, "The Controlling Terminal" states the following:


The controlling terminal for a session is allocated by the session leader in an implementation-defined manner. If a session leader has no controlling terminal, and opens a terminal device file that is not already associated with a session without using the O_NOCTTY option (see open()), it is implementation-defined whether the terminal becomes the controlling terminal of the session leader. If a process which is not a session leader opens a terminal file, or the O_NOCTTY option is used on open(), then that terminal shall not become the controlling terminal of the calling process.

The breakdown

When you fork and kill the parent of the fork the new child will become a child of "init" (the main process of the system, given a PID of 1).  You may find others state on the interweb that the double-fork is needed for this to happen, but that isn't true, a single fork will do this.

What we are actually doing is a kind of safety thing to make sure that the daemon is completely detached from the terminal.  The real steps behind the double-fork are as follows:
  1. The parent forks the child
  2. The parent exits
  3. The child calls setsid() to start a new session with no controlling terminals
  4. The child forks a grandchild
  5. The child exits
  6. The grandchild is now the daemon
The reason we do step 4 & 5 is it is possible for the child to regain control of the terminal, but once it has lost control the forked grandchild cannot do this.

Put simply it is a roundabout way of completely detaching itself from the terminal that started the daemon.  It isn't a strict requirement to do this at all, many modern init systems can daemonize a process that will stay in the foreground quite easily.  But it is useful for systems that can't do this and for anything that at some point in time is expected to be run without an init script.

Why VLAIS is bad

At the beginning of the month I was at FOSDEM interfacing and watching talks on behalf of HP's Advanced Technology Group.  Since my core passion is working on and debugging C code I went to several talks on Clang, Valgrind and other similar technologies.  Unfortunately there I several talks I couldn't get into due to the sheer popularity of them but I hope to catch up with video in the future.

One talk I went to was on getting the Linux kernel to compile in Clang.  It appears that there are many changes which are down to Clang being a minimum of C99 compliant and GCC supporting some non-standard language extensions.

The one language extension which stood out for me was called VLAIS which stands for Variable Length Arrays In Structs.  Now, VLAs (Variable Length Arrays) are nothing new in C they have been around a long time.  What we are talking about here are variable length arrays at any point in the struct.  For example:

void foo(int n) {
    struct {
        int x;
        char y[n];
        int z;
    } bar;
}

The char in this struct is what we are talking about here.  This kind of code is used in several places in the kernel, for the most part it is used in encryption algorithms.

How did it get added to GCC?


It came about around 2004 in what appears to be a conversion of a standard from ADA to C.  There is a mailing list post on it here.  It has since been used in the kernel and I can understand the argument that the kernel was never intended to be compiled with anything other than GCC.  But the side of me that likes openness and portability is not so keen.  I suspect the problems that currently plague the kernel for Clang also affect compilers such as Intel's ICC and other native CPU manufacturer's compilers.

So why is it bad?


Well, to start with there is the portability issue.  Any code that uses this will not compile in other compilers.  If you turn on the pedantic C99 flags in GCC it won't compile either (if you aren't doing this then you really should, it shakes out lots of bugs).  Once Linus Torvalds found out about the usage of it in the kernel he called it an abomination and asked for it to be dropped.

Next there is debugging.  I'm not even sure if debuggers understand this and if they do I can well imagine it being difficult to work with, especially if you need to track Z in the example above.

There is the possibility of alignment issues.  Some architectures work much better when the structs are byte aligned by a certain width.  This will be difficult to do with a VLAIS in the middle of the struct.

In general I just don't think it is clean code and if it were me I would be using pointers and allocated memory instead.

Of course this is all my opinion and I'm sure people have other views that I haven't thought of.  Please use the comments box to let me know what you think about VLAIS and its use in the kernel.  You can find out more in the Linux Plumbers Conference 2013 slides.