Showing posts with label LBaaS. Show all posts
Showing posts with label LBaaS. Show all posts

Monday, 10 February 2014

HAProxy logs byte counts incorrectly

Continuing my LBaaS look back series of blog posts for HP's Advanced Technology Group I am today looking into an issue that tripped us up with HAProxy.

Whilst we were working with HAProxy we naturally had many automated tests going through a Jenkins server.  One such test was checking that the byte count in the logs tallied with bytes received, this would be used for billing purposes.

Unfortunately we always found our byte counts a little off.  At first we found it was due to dropped log messages.  Even after this problem was solved we were still not getting an exact tally.
After some research and reading of code I found out that despite what the manual says the outgoing byte count is measured from the backend server to HAProxy, not the bytes leaving HAProxy.  This means that injected headers are not in the byte count and if HAProxy is doing HTTP compression for you the count will be way off.

My findings were backed by this post from the HAProxy developer.

On average every log entry for us was off by around 30 bytes due to injected headers and cookies.
Given the link above this appears to be something the developer is looking into but I doubt it will be a trivial fix.

Sunday, 9 February 2014

Working with Syslog

As part of my transition to HP's Advanced Technology Group I am winding down my contributions to HP Cloud's Load Balancer as a Service project (where I was technical lead).  I thought I would write a few blog posts on things we experienced on this project.

Whilst we were working on the Stackforge Libra project we added a feature that uploads a load balancer's log file to Swift. To do this we store the HAProxy log file using Syslog into a separate file.  The syslog generated file is what Libra uploaded.

Our installation of Libra is on an Ubuntu LTS setup, but the instructions should be very similar for other Linux distributions.

Logging to a New File


Syslog has the facility to capture certain log types and split them into separate files.  You can actually see it doing this in various /var/log/ files.  Having syslog handle this makes it far easier to manage, rotate, etc... than having daemons write to files in their own unique way.

To do this for HAProxy we create the file /etc/rsyslog.d/10-haproxy.conf with the following contents:

$template Haproxy,"%TIMESTAMP% %msg%\n"
local0.* -/mnt/log/haproxy.log;Haproxy
# don't log anywhere else
local0.* ~

In this example we are using /mnt/log/haproxy.log as the log file, our servers have an extra partition there to hold the log files.  The log will be written in a format similar to this:

Dec 10 21:40:20 74.205.152.216:54256 [10/Dec/2013:21:40:19.743] tcp-in tcp-servers/server20 0/0/1075 4563 CD 17/17/17/2/0 0/0

From here you can add a logrotate script called /etc/logrotate.d/haproxy as follows:

/mnt/log/haproxy.log {
    weekly
    missingok
    rotate 7
    compress
    delaycompress
    notifempty
    create 640 syslog adm
    sharedscripts
    postrotate
        /etc/init.d/haproxy reload > /dev/null
    endscript
}

This will rotate weekly, compressing old logs and retaining up to 7 log files.

Syslog Flooding


We soon found a problem, the generated log file was not recording all log entries on a load balancer that was getting hammered.  When looking through the main Syslog file for clues we discovered that the flood protection had kicked in and we were seeing log entries as follows:

Jul 3 08:50:16 localhost rsyslogd-2177: imuxsock lost 838 messages from pid 4713 due to rate-limiting

Thankfully this flood protection can be tuned relatively simply by editing the file /etc/rsyslog.conf and add the following to the end of the file:

$SystemLogRateLimitInterval 2
$SystemLogRateLimitBurst 25000

Then syslog needs to be restarted (reload doesn't seem to apply the change):

sudo service rsyslog restart

After this we found all log entries were being recorded.

Monday, 20 January 2014

The importance of backup verification

I have recently moved to HP's Advanced Technology Group which is a new group in HP and as part of that I will be blogging a lot more about the Open Source things I and others in HP work on day to day.  I thought I would kick this off by talking about work that a colleague of mine, Patrick Crews, worked on several months ago.

For those who don't know Patrick, he is a great Devops Engineer and QA.  He will find new automated ways of breaking things that will torture applications (and the Engineers who write them). I don't know if I am proud or ashamed to say he has found many bugs in code that I have written by doing the software equivalent of beating it with a sledgehammer.

Every Devops Engineer worth his salt knows that backups are important, but one thing that is regularly forgotten about is to check whether the backups are good.  A colleague of mine from several years back, Ian Anderson, once told me about the hunt for a good tape archive vendor.  He tested them by getting them to pick a randomly selected tape from the archives and reading it, timing how long it takes to do so.  You would be surprised the vendors who couldn't perform this task, I'd hate to see what would happen in a real emergency.

There was also the case of CouchSurfing which was crippled back in 2006 when after a massive failure they found their backups to be bad.  They eventually rebuilt and is a great site today, but this kind of damage can cost even a small company many thousands of dollars.

The main thing I am trying to stress here is that it is important not just to make backups but also to make sure the backups are recoverable.  This should be done with verification and even fire drills.  There may come a time where you may really need that backup in an emergency and if it isn't there, well you just burnt your house down.

Before I was a member of the Advanced Technology Group I was the Technical Lead for the Load Balancer as a Service project for HP Cloud.  We had a very small team and needed a backup and verification solution that was mostly automated and reliable.  This thing would be hooked up to our paging system and none of my team like being woken at 2AM :)

This is a very crude diagram of the solution Patrick developed:
The solution works as follows:
  1. Every X minutes Jenkins will tell the database servers to make a backup
  2. The MySQL database servers encrypt that backup and push it up to the cloud file storage (Openstack Swift)
  3. Jenkins would then trigger a compute instance (Openstack Nova) build and install MySQL on it
  4. The new virtual machine would grab the backup, decrypt it, restore it and run a bunch of tests on the data to see if it is valid
  5. If any of the above steps failed send out a page to people
Most of the above uses salt to communicate across the machines.  Have we ever been paged by this system?  Yes, but so far only because step 3 failed, either due to a Nova build fail or once due to a salt version incompatibility.  We have since added some resilience into the system to stop this happening again.

As well as the above there are monthly fire drills to manually test that we can restore from these backups.  We also regularly review the testing procedures to see if we can improve them.

This is going to sound a little strange but sometimes the best Devops Engineers are lazy.  By that I don't mean they don't do any work (they are some of the hardest working people I know), but they will automate everything so they don't have to do a lot of boring manual labour and they hate being woken by pagers at silly hours of the morning.  Some of the best Devops Engineers I know think in these terms :)

Friday, 18 October 2013

Stackforge Libra - Balance your life!

I have been pretty quiet on the blogging front for quite a long time now.  The main reason for this is I have been working very hard on leading a small team which is developing a Stackforge project called 'Libra'.  As you can probably guess from the name, Libra is a Load Balancer as a Service system.  Many of you may not of heard of it but according to Stackalytics it was the 27th biggest project in terms of code contributed during Havana and 2nd biggest in HP (something I am especially proud of because it has been one of the smallest teams in HP Cloud).

It is based on the Atlas API specifications, creates software-based load balancers, is implemented in Python and sits on top of Nova instances rather than working under the cloud.  It also has several unique features which could be converted to run any service on top of a cloud.

Components

Libra consists of several components, the service components are designed to be installed on multiple instances to create a highly available setup:

  • API server - A Pecan based API server whose API is based on the Atlas API spec.
  • Admin API server - A Pecan based Administrative API server (work in progress) and a whole bunch of modules which automatically maintain the health of the Libra system.
  • Pool Manager - A gearman service which will provide the rest of the system with load balancer devices and floating IPs
  • Worker - A gearman service sitting on each load balancer to configure it
There is also an easy-to-use command line client called python-libraclient.

The service requires Gearman to communicate between the various components and MySQL as a load balancer configuration data store. Our team has been developing SSL support in Gearman to increase the security of the communications and Libra is probably the first open source project to support this.  Security is a high priority for HP and the Libra team take it very seriously. Libra also supports Galera clusters for MySQL by having built-in auto-failover and recovery as well as commit deadlock detection and transaction retry.  The codebase can support any load balancing software (and in theory hardware) through plugins to the Worker but currently HAProxy is the only plugin being developed.

Auto-Recovery

Libra also has an intelligent auto-recovery system built in. In HP we are currently testing the 4.x release in a few racks that have a flaky networking setup and it has been consistently repairing devices just like a T-1000 repairs bullet holes.  Somehow we have even been able to create usable load balancers from this setup too!
The system is designed to keep a constant pool of hot spare load balancers on standby ready so that one can be provisioned very quickly.  The health of this pool is constantly checked, bad devices in the pool are destroyed and replaces automatically with new ones.

With version 3.x (HP Cloud Private Beta) we are seeing provision times of roughly 300ms.  This has increased slightly for Libra 4.x because it introduces floating IP support (called Virtual IPs in Atlas API).  The floating IP support means that the Libra system can automatically detect when a load balancer has failed, within seconds rebuild the load balancer on a new hot spare and move the IP address accordingly.

Other Uses

The architectural design of Libra is such that it could be used to create any "Platform-as-a-Service" on top of Nova (or any other cloud with minor modifications).  The system can be modified to work with anything by changing the API and giving it a Worker plugin.  This gives Libra codebase the ability to become an eventual framework in the future.

Libra and HP

Libra 3.x is currently installed as our LBaaS Private Beta and an installation of 4.x will be coming in the next few weeks.  We have learnt a lot from customer feedback and have a lot of interesting ideas in the pipeline or in testing (such as Galera cluster load balancers!  Handy for DBaaS).  If you want to find out more about Libra you can see the HP Cloud LBaaS page, look at the developer docs, come chat to us on #stackforge-libra Freenode IRC channel and if you want to give it a spin take a look at our PPA.

We want to be more open with the development process of Libra and are taking steps to ensure that happens by engaging more with the wider community, resource constraints have made this difficult up until now.  We are happy to help anyone who wants to get started playing with or hacking on any part of it.

Sunday, 24 February 2013

First version of Drizzle Tools for MySQL servers released

Today marks the first release of Drizzle Tools for MySQL servers.  Drizzle Tools aims to be a collection of useful utilities to use with MySQL servers based around the work on the Libdrizzle Redux project.

In this first version there is one utility in the tree called 'drizzle-binlogs'.  If you've seen me talk about this tool before it is because it used to be included in the Libdrizzle 5.1 source but has now been moved here to be developed independently.  For those who haven't 'drizzle-binlogs' is a tool which connects to a MySQL server as a slave, retrieves the binary log files and stores them locally.  This could be used as part of a backup solution or a rapid way to help create a new MySQL master server.

Due to the API changes before the Libdrizzle API became stable Drizzle Tools requires a minimum of Libdrizzle 5.1.3 to be installed.

I wanted to release this sooner but unfortunately most of my time has been taken up with the first release of the project I manage and develop for my day job (HP Cloud's Load Balancer as a Service, more about this in a future blog post).

In the not too distant future there will be more tools included in the Drizzle Tools releases, I have the next one already 50% developed.  In the mean time you can download the first version here.