The Paradox Of The Mail Server On The Cloud

Cloud Mail ParadoxProviding your web application with a mail service that works flawlessly is probably essential for your business. You need to send activation emails to users, password reset emails, newsletters and probably a whole bunch of other emails that have to do with interactions with your application.

When there were only physical servers and static IP addresses, everything worked perfectly. But now, when your application is in the cloud, setting up a working mail server next to your application is ridiculously impossible. If your application is successful and you would like to send emails to your millions of satisfied users, your options come down to:

  1. Use a physical hosted server.
  2. Use a 3rd party email service.
  3. Set up a mail server in the cloud and compromise on some/most being marked as spam.

For us cloud oriented developers, option 1 is as useful as somebody suggesting you’d use a cassette tape recorder to put your favorite songs on. It’s old, unreliable, can’t scale. Option 2 is very costly if your business is successful, and most of these services don’t deal with the amount of mails you need to send if you have a large scale user base. Option 3 will make your email communication efforts with your users almost non-existent, which means you can’t afford it as well. So your only option is to compromise somewhere.

Why is sending email from the cloud so difficult?

In order for your mail server to operate successfully and be trusted by mail services around the world, you need to abide by the following rules:

  1. Don’t be an open relay.
  2. Implement (and follow) SPF policy (and DKIM if possible).
  3. Have a PTR record that resolves back exactly to your mail server hostname.
  4. Don’t let your public IP address be listed in any RBLs.

Rule #1 is easily implemented in any mail server configuration, and there are also a number of online tools to test if you’re an open relay or not. Option #2 is also pretty easy to implement, assuming you control your DNS zone files and know your way around it.

The problem of mail on the cloud begins with rules #3 and #4. A PTR record, which is a reverse DNS entry, must be present and correct for your mail server to not be considered spammy. If your mail server is at and is called, the PTR query for (well, for must return The PTR record can only be changed by the owner of the IP address, or by a delegation of his authority to you. Amazon Web Services do not let you control PTR records, so there goes the option for a mail server on EC2.

Other clouds let you control the PTR records for the IP addresses they assigned to you. But they fail on Rule #4. While your specific IP address might not be blacklisted in RBLs, the entire block that it belongs to might be blacklisted, because these IP addresses are assigned dynamically and therefore are always suspected as spammy by these lists. This is the case with Rackspace Cloud for example, and is the only thing left to be solved before you can run a mail server there. And although they’re trying to get their address block de-listed, this problem still persists.

Other clouds I’ve examined in this space are GoGrid and Joyent. GoGrid want you to fill up a questionnaire, and only then they open up port 25 for you. This sounds absurd, and against all the on-demand nature of the cloud (and I also personally don’t trust ServePath, the company that operates GoGrid). Joyent’s offering seem to disregard the option of hosting a mail server with them, and I couldn’t get their response on this matter.

So unless Rackspace Cloud solve their IP block blacklisting problem, or AWS offer a PTR setting option (plus no blacklisting as well), we’re left with the need to compromise.

The only feasible solution right now — seems like it’s back to physical hosting.

Cron Script To Snapshot Any Attached EBS Volume

If you would like to cron snapshots of any attached volume to an instance, you can use the following script. It uses the EC2 command line tools to see what volumes are currently attached to this instance, and takes a snapshot. Make sure to replace all the variables on the top of the script to match your own.


export JAVA_HOME=/usr/java/default
export EC2_HOME=/vol/snap/ec2-api-tools-1.3-26369

INSTANCE_ID=`curl -s`
echo "Instance ID is $INSTANCE_ID"
VOLUMES=`$EC2_HOME/bin/ec2-describe-volumes | grep "ATTACHMENT" | grep "$INSTANCE_ID" | awk '{print $2}'`
echo "Volumes are: $VOLUMES"

for VOLUME in $VOLUMES; do
        echo "Snapping Volume $VOLUME"
        DEVICE=`$EC2_HOME/bin/ec2-describe-volumes $VOLUME | grep "ATTACHMENT" | grep "$INSTANCE_ID" | awk '{print $4}'`
        echo "Device is $DEVICE"
        MOUNTPOINT=`df | grep "$DEVICE" | awk '{print $6}'`
        echo "Mountpoint is $MOUNTPOINT"

        # Snapshot
        SNAPSHOT_ID=`$EC2_HOME/bin/ec2-create-snapshot $VOLUME`

        echo "Snapshotted: $SNAPSHOT_ID"


If you’re wondering why $MOUNTPOINT is important (it’s not used here after all), it’s because you might want to freeze your filesystem if it’s XFS, so you could safely take a snapshot of a MySQL database for example. So you could easily wrap the snapshot create command with this:

        # freeze
        xfs_freeze -f $MOUNTPOINT

        # Snapshot
        SNAPSHOT_ID=`$EC2_HOME/bin/ec2-create-snapshot $VOLUME`

        # unfreeze
        xfs_freeze -u $MOUNTPOINT

And if you are indeed using this script to snapshot a volume with MySQL on it, you need also to flush tables with read lock, and gather information on master and slave positions. For this task you can use Eric Hammond‘s script, and incorporate it to the cron script. (You can read more about MySQL and XFS on EC2 on the AWS site).

Google Chrome 2, CSS Content-Type and Amazon S3

Google ChromeIt seems that ever since Google Chrome 2 was released, some of the CSS files I was serving from S3 were not being treated as valid by it, and the page layouts would break because of it. Firefox and IE were fine with it, and Chrome 1 was ok with it too. It was just Chrome 2.

A little inspection showed that the CSS files I stored on S3 were not being served with a Content-Type header, while from a standard apache web server they were. This combined with the new strictness of Chrome 2 (actually resulting from a new strictness in WebKit) made Chrome not treat these files as actual CSS, and break the page.

So the obvious solution was to make the CSS files be delivered from S3 with the correct “Content-Type: text/css” header. Fortunately enough, this is very easy to do with S3 API. Just pass the “Content-Type:text/css” header when you’re uploading the file to S3, and it will be present in the response headers whenever someone requests the file.

Here’s to the browser wars, that never end and got more complicated with the new player in town, Google Chrome.

Detaching Infrastructure From Physical Hosts: Fantasy vs. Reality

Dead Harddrive
Image via

Cloud computing has brought along the promise of easy-to-scale-and-yet-affordable computer clusters. There are various clouds out there that provide Infrastructure as a Service, such as Amazon EC2, Google App Engine, Mosso, and the newcomer Sites to name a few. I personally have experience as a developer only with Amazon EC2, and I am a devoted fan and user of the entire AWS stack. Nonetheless, I believe that what I have to say here is relevant to all other platforms.

While the cloud and IaaS model have indeed many significant advantages over traditional physical hosting, there is one major annoyance still to overcome in this space, and that is: your virtual host is still connected to a physical machine. And that machine is non-redundant, it doesn’t have any hot backup, and there’s no way to transparently and hassle-free fail over from it once its malfunctioning. And this is why, from time to time I get this email from Amazon:


We have noticed that one or more of your instances are running on a host degraded due to hardware failure.


The host needs to undergo maintenance and will be taken down at XX:XX GMT on XXXX-XX-XX. Your instances will be terminated at this point.

The risk of your instances failing is increased at this point. We cannot determine the health of any applications running on the instances. We recommend that you launch replacement instances and start migrating to them.

Feel free to terminate the instances with the ec2-terminate-instance API when you are done with them.

Let us know if you have any questions.


The Amazon EC2 Team

At this stage, this is one of the greatest shortcomings of EC2 from my point of view. As a customer of EC2, I don’t want to care if a host has hardware failure. Why can’t my instance just be mirrored somewhere else, consistent hot-backup style, and upon failure of host hardware be transparently switched to the backup host? I don’t care paying the extra buck for this service.

In my vision, in a true IaaS cloud there is no connection between the virtual machine and the physical host. The virtual machine is truly floating in the cloud, unbound to the physical realm by means of some consistent mirroring across physical hosts.

And you might be thinking “you can implement this on your own on the existing infrastucture that EC2 offers”, and “you should be prepared for any instance going poof”. And you are correct, at the current offering of EC2, this is the case. You always have to be prepared for an instance failure (in the last month, I had 2 physical hosts failure out of about 20, that’s about a monthly 10% (!!) ), and you always have to build your architecture so that a single host failure can fail over gracefully. But were my vision a reality, I wouldn’t have to worry about these things, and wouldn’t have to spend time and money on the overhead that they incur.

I am not certain that this is the situation in the other clouds, but if it is not, it might come with the price of less flexibility, which is a major part of EC2 on which I am not willing to give up. If that flexibility can be maintained, I would love to see my vision become a reality on EC2.

Network Latency Inside And Across Amazon EC2 Availability Zones

I couldn’t find any info out there comparing network latency across EC2 Availability Zones and inside any single Availability Zone. So I took 6 instances (2 on each US zone), ran some test using a simple ping, and measured 10 Round Trip Times (RTT). Here are the results.

Single Availablity Zone Latency

Availability Zone Minimum RTT Maximum RTT Average RTT
us-east-1a 0.215ms 0.348ms 0.263ms
us-east-1b 0.200ms 0.327ms 0.259ms
us-east-1c 0.342ms 0.556ms 0.410ms

It seems that at the time of my testing, zone us-east-1c had the worst RTT between 2 instances in it, almost twice as slow as the other 2 zones.

Cross Availablity Zone Latency

Availability Zones Minimum RTT Maximum RTT Average RTT
Between us-east-1a and us-east-1b 0.885ms 1.110ms 0.937ms
Between us-east-1a and us-east-1c 0.937ms 1.080ms 1.031ms
Between us-east-1b and us-east-1c 1.060ms 1.250ms 1.126ms

It’s worth noting that in cross availability zones traffic, the first ping was usually off the chart, so I disregarded it. For example, it could be anywhere between 300ms to 400ms, and the the rest would fall down to ~0.300. Probably some lazy routing techniques by Amazon’s routers.


  1. Zones are created different! — At least at the time of the testing, if you have a cluster on us-east-1b it performs almost twice as fast with regards to RTT between machines than a cluster on us-east-1c.
  2. Cross Availability Zones latency can be 6 times higher than inner zone latency. For a network intensive application, better keep your instances crowded in the same zone.

I should probably also make a throughput comparison between and across Availability Zones. I promise to share if I get to test it.

Hardware Failure Apocalypse

I might know a thing or two about handling servers, configs, deployments and cloud architecture. But when it comes to hardware failure on my own workstation, I become a complete layman.

It’s the first time my Lenovo R61 failed me. It’s running a mighty Ubuntu 8.04, with all the components a hacker needs (from a complete LAMP stack, through PDT and a customized version of  svn 1.5.1, to InkScape and xvidcap…), and it’s the first time that after the system froze and I rebooted, I just gazed at the terminal at startup and shrieked:

Kernel panic – not syncing: Attempted to kill init!

And a whole other bunch of error messages, every time at a different stage in the boot sequence. This behavior, combined with the fact that the system just froze and I didn’t do any dramatic changes, makes me think it’s bad RAM or other hardware components (like here, and disk is of course a candidate), but sometimes it seems like people get over it by re-installing a kernel.

I don’t know what I prefer, hardware or software failure. I guess that RAM failure is the best, just swap it with new RAM. Disk failure might mean data loss, which I am sure I don’t want to handle, and recompiling the kernel can be a tedious task, but preferable than losing data and having to re-install the whole system again.

And what I asked myself, when I rode my bike back home today, is “why can’t I just instantiate a new instance in the cloud with the newest working snapshot of my system? Why hardware failure in the cloud is so easy to deal with, and hardware failure in the office isn’t?”. And I had a vision of all the people working on machines similar to mainframe terminals, running only the basic things and having the OS and all the data just sit in the cloud.

This day isn’t far. But tomorrow it’s back to the lab to (hopefully) have my RAM replaced.

How to delete those old EC2 EBS snapshots

EBS snapshots are a very powerful feature of Amazon EC2. An EBS volume is readily available, elastic block storage device that can be attached, detached and re-attached to any instance in its availability zone. There are numerous advantages to using EBS over the local block storage devices of an instance, and one of the most important of them is the ability to take a snapshot of the data on the volume.

Since snapshots are incremental by nature, after an initial snapshot of a volume, the following snapshots are quick and easy. Moreover, snapshots are always processed by Amazon’s processing power and not by the cpu of your instance, and are stored redundantly on S3. This is why using these snapshots in your backup methodology is a great idea (provided that you freeze/unfreeze your filesystem during the snapshot call, using LVM or XFS for example).

But, and this is a really annoying but – snapshots are “easy come hard to go”. They are so convenient to use and so reliable, that it’s natural to use a cronned script to make a daily, or hell — hourly! — backup of your volume. But then, those snapshots keep piling up, and the only way to delete a snapshot is to call a single API call for a specific snapshot.If you have 5 volumes you back up hourly, you reach the 500 snapshots limit withing 4.5 days. Not very reliable now, huh?

I have been searching for a while for an option to bulk delete snapshots. The EC2 API is missing this feature, and the excellent ElasticFox add-on is not compensating. You just can’t bulk delete snapshots.

That is, until now:). I asked in the AWS Forum if there is anything that can be done about this problem. They replied it’s a good idea, but if I really wanted it to be implemented quickly, I should build my own solution using the API. So I took the offer, and came up with a PHP command line tool that tries to emulate a “ec2-delete-old-snapshots” command, until one is added to the API.

The tool is available on Google Code for checkout. It uses the PHP EC2 library which I bundled in (hope I didn’t break any licensing issue, please alert me if I did).

Usage is easy:

php ec2-delete-old-snapshots.php -v vol-id [-v vol-id ...] -o days

If you wanted to delete ec2 snapshots older than 7 days for 2 volumes you have, you would use:

php ec2-delete-old-snapshots.php -v vol-aabbccdd -v vol-bbccddee -o 7

Hope this helps all you people out there who need such a thing. I will be happy to receive feedback (and bug fixes) if you start using this.