Thursday, 26 June 2014

CoreOS Review

I have spent a few days now playing with CoreOS and helping other members of HP's Advanced Technology Group get it running on their setups.

Today I thought I would write about the good and the bad about CoreOS so far.  Many components are in an alpha or beta state so things may change over the coming months.  Also as a disclaimer, views in this post are my own and not necessarily those of HP.

Installation


As stated in my blog post yesterday, I have been using CoreOS on my Macbook Pro using Vagrant and VirtualBox.  This made it extremely easy to setup the CoreOS cluster on my Mac.  I made a minor mistake to start with, and that is to not configure the unique URL required for Etcd correctly.  A mistake a colleague of mine also made on his first try so it likely to be a common one to make.

I initially had VirtualBox configured to use a Mac formatted USB drive I have hooked-up.  Vagrant tried to create my CoreOS cluster there and during the setup the Ruby in Vagrant kept spinning on some disk reading routine and not completing the setup.  Debug output didn't help find the cause so I switched to using the internal SSD instead.

CoreOS


CoreOS itself appears to be derived from Chrome OS which itself is a fork of Gentoo.  It is incredibly minimal, there isn't even a package manager that comes with it.  But that is the whole point.  It is designed so that Docker Containers are run on top of it providing the application support.  Almost completely isolating the underlying OS from the applications running on top of it.  This also provides excellent isolation between say MySQL and Apache for a LAMP stack.

It is a clean, fast OS using many modern concepts such as systemd and journald.  Some of these are only in the bleeding-edge distributions at the moment so many people may not be familiar with using them.  Luckily one of my machines is running Fedora 20 so I've had a play with these technologies before.

Etcd


CoreOS provides a clustered key/value store system called 'etcd'.  The name of this confused many people I spoke to before we tried it.  We all assumed it was a clustered file store for the /etc/ path on CoreOS.  We were wrong, although that is maybe the direction it will eventually take.  It actually uses a REST based interface to communicate with it.

Etcd has been pretty much created as a new project from the ground-up by the CoreOS team.  The project is written in Go and can be found on Github.  Creating a reliable clustered key/value store is hard, really hard.  There are so many edge cases that can cause horrific problems.  I cannot understand why the CoreOS team decided to roll their own instead of using one of the many that have been well-tested.

Under the hood the nodes communicate to each other using what appears to be JSON (REST) for internal admin commands and Google Protobufs over HTTP for the Raft Consensus Algorithm library used.  Whilst I commend them for using Protobufs in a few places, HTTP and JSON are both bad ideas for what they are trying to achieve.  JSON will cause massive headaches for protocol upgrades/downgrades and HTTP really wasn't designed for this purpose.

At the moment this appears to be designed more for very small scale installations instead of hundreds to thousands of instances.  Hopefully at some point it will gain its own protocol based on Protobufs or similar and have code to work with the many edge cases of weird and wonderful machine and network configurations.

Fleet


Fleet is another service written in Go and created by the CoreOS team.  It is still a very new project aimed at being a scheduler for a CoreOS cluster.

To use fleet you basically create a systemd configuration file with an optional extra section to tell Fleet what CoreOS instance types it can run on and what it would conflict with.  Fleet communicates with Etcd and via. some handshaking figures out a CoreOS instance to run the service on.  A daemon on that instance handles the rest.  The general idea is you use this to have a systemd file to manage docker instances, there is also a kind-of hack so that it will notify/re-schedule when something has failed using a separate systemd file per service.

Whilst it is quite simple in design it has many flaws and for me was the most disappointing part of CoreOS so far.  Fleet breaks, a lot.  I must have found half a dozen bugs in it in the last few days, mainly around it getting completely confused as to which service is running in which instance.

Also the way that configurations are expressed to Fleet are totally wrong in my opinion.  Say, for example, you want ten MySQL docker containers across your CoreOS cluster.  To express this in Fleet you need to create ten separate systemd files and send them up.  Even though those files are likely identical.

This is how it should work in my opinion:  You create a YAML file which specifies what a MySQL docker container is and what an Apache/PHP container is.  In this YAML you group these and call them a LAMP stack.  Then in the YAML file you specify that your CoreOS cluster needs five LAMP stacks, and maybe two load balancers.

Not only would my method scale a lot better but you can then start to have a front-end interface which would be able to accept customers.

Conclusion


CoreOS is very ambitions project that in some ways becomes the "hypervisor"/scheduler for a private Docker cloud.  It can easily sit on top of a cloud installation or on top of bare metal.  It requires a totally different way of thinking and I really think the ideas behind it are the way forward.  Unfortunately it is a little too early to be using it in anything more than a few machines in production, and even then it is more work to manage than it should be.

9 comments:

  1. I agree with your review. The configuration file oddity especially. I scratched my head. It also doesn't grip when you have invalid syntax which is bad. In Ansible, I can do it this way:

    - hosts: docker
    vars:
    containers_count: 5
    roles:
    - common
    - web

    - hosts: docker
    vars:
    containers_count: 3
    roles:
    - common
    - db

    - hosts: docker
    vars:
    containers_count: 2
    roles:
    - common
    - haproxy

    Done. Clean. It works. And if I type syntax errors, you better believe Ansible won't let me get away with it.

    ReplyDelete
  2. Hi Kelsey,

    Many thanks for your comment.

    Although the docs aren't perfect, they are much better than many I have seen for various projects I have worked with lately. You guys should be commended for that. I wouldn't have been able to get everything up and running so far without them.

    Etcd actually works pretty well considering how young the project is. I'm not a big fan of JSON due to versioning and several other problems that can bite you with it. I think there is even a point made of no current stability with your protocol somewhere in the docs. In short I do think it is coming together. I highly recommend moving at all over to Protobuf (as we did with the Drizzle database project).

    I may have been a little harsh on Fleet, you can kinda see the direction it is heading in, and it is incredibly young as far as projects go. I am also a fan of release early/often. It was just a little frustrating when you can see things working together so well that there are a few kinks to knock out :)

    All of my team are currently playing with CoreOS as it is something we have great interest in. In fact one member of my team is looking to see if he can get it running on our Moonshot box. My personal opinion is Moonshot would be a fantastic fit for CoreOS.

    Thanks again, we will be watching CoreOS with great interest :)

    ReplyDelete
  3. A question if I may. How do you make Fleet put MySQL instances on the server with the attached storage for the database. The MySQL instance cannot just move servers upon reboot or after failure, the right attached storage has to be mounted with the right MySQL instance.

    Related, if a server has a hardware failure, I want to get another node up and the attached storage on that node instead. How do you achieve this? Mount the storage as a precursor to the MySQL start-up script?

    Thanks!

    ReplyDelete
    Replies
    1. A very good question. It is something I thought about myself before writing this blog post.

      The problem with the attached storage is you have to make sure that the old instance is completely down before starting the new one, otherwise you will get into a big mess. Something like a network blip could break it badly.

      I would imagine it would be done using Galera Cluster. You would need some scripts to join a new node to a cluster but this is doable. My previous team in HP did this for Openstack Nova instances using a combination of Salt and Jenkins.

      Delete
    2. Create another service for the mountpoint. Then make the mysql service depend upon it.

      Delete
    3. This could be dangerous for a couple of reasons:
      1. You need to make 100% sure the mount point is mounted before MySQL is started on the new instance
      2. You need to make 100% sure that MySQL isn't accidentally started on two servers at the same time for the same data. This is harder than it sounds because a split-brain in etcd could conceivably cause this.

      I still recommend using a clustered MySQL setup designed for this such as Galera or NDB. I would say alternatively manual intervention, but if you have to manually intervene in a production system you have already failed.

      Delete
  4. Nice write up Andrew and nice follow-up by Kelsey Hightower.
    Kelsey, I was at the Denver Gophercon and loved your Coreos demo! Also liked Alex Polvi's fleet demo at the Coreos stall.

    I concur with Andrew that the concepts are very interesting and moving in the right direction, just that some of its tools need to be ironed out to confidently use at cloud scale in Production.

    I used etcd several months ago when I was working on putting together a go based solution. I was trying to replace a Vert.IO(vertx) cluster that was using Hazelcast as the distributed key-value store, with a go based solution (just for the fun of moving away from Java ;) ). Another well tested Java based key-value store is Zookeeper. Before I went with etcd I tried dozer and redis. dozer implemented Paxos and their documentation recommended it for only smaller key-value sizes. redis is awesome except that their distributed solution is in alpha state (even today!). So, I went ahead with coreos/etcd as its RAFT based consensus algorithm was interesting. Also, I liked their framework for modules as I can avoid unnecessary middleware processes. However, like Andrew mentioned, the HTTP/JSON overhead was overwhelming when the writes were much higher than the reads (I was trying to use etcd for distributed rate limiting of API requests).I was comparing a 3 node Vertx/Hazelcast cluster with a 3 node etcd cluster (with gorilla for REST). The former was stable, especially when a node went down and was added again later to the cluster. Agreed, my tests were over 3 months ago and it might be very different today (considering Google uses it now!).
    However, this week I wanted to give coreos/fleet a shot and so set it up on my HP Mobile Workstation (SSD drive with 24GB RAM) using Vagrant/VirtualBox. I followed the step by step instructions from https://coreos.com/docs/running-coreos/platforms/vagrant/ (great job with the docs!) and setup a 3 node cluster in minutes (using the discovery service and choosing the stable channel for updates). Following the example from the site, I was able to quickly write an unit file and start a busybox based container across the cluster using fleetctl. The problem started when I wanted to continue with the example in the site to setup an announcer sidekick unit. Fleet quickly got messed up and it would put the unit in failed state. Also, it started throwing random "SSH_AUTH_SOCK environment variable is not set" errors. Am still trying to get this resolved and move ahead as I like the coreos concept and love golang :)

    I also briefly tried fig and liked their yaml syntax (similar to what Andrew suggested) to link related containers - interesting to note that fig is now acquired by Docker (dotcloud). There is even a fig2coreos project!

    -TSV
    (Disclaimer: this is my own opinion and not that of HP)

    ReplyDelete
  5. You can have N instances of a service by using templates, like feeltctl start mysql@1.service . Write a mysql@.service and use %i to identify the name.

    ReplyDelete
    Replies
    1. You can indeed, but you will need your application to manage the sharding around that (assuming you are using it for sharding).

      Delete

Note: only a member of this blog may post a comment.