Dr. SpotInstance

I recently moved all my compute resources over to Spot Instances in AWS. I love it. If you’re not using Spot in your applications, I hope this will convince you to look into it.

A Demanding Infrastructure

Your standard instance is known as an On Demand instance, meaning you launch and terminate as needed. For this, you pay a set amount per minute. Once launched, the instance will continue to operate until something terminates it.

Reasons for termination:

Termination requested through the console or api.
An autoscaling group selects it to scale in.
An ELB marks it unhealthy and an ASG terminates it.
The hardware underneath it is deprecated and Amazon sets it for retirement. They notify you in these situations.
The hardware underneath has a catastrophic failure. These are extremely rare at this point.

When a new instance is requested, AWS provides them on a best-effort basis. I have encountered situations where a specific instance type in a specific Availability Zone isn’t available. This happens infrequently, but is always a possibility. When this happens, the only option is to wait for more capacity to become available, or move the workload to a different instance class and/or a different AZ.

Spotty Reception

Spot Instances are just like On Demand instances, except for a few key points:

You can set the maximum price you’re willing to pay for the instance. This can result in a huge savings over On Demand.
AWS can terminate your spot instance at any time, with a 2 minute warning.

As long as the price specified is higher than the current spot market price, and there is capacity, the request will be fulfilled.

They’re Not Bids

Formerly, the Spot Market operated on bids. That meant that another customer launching an instance with a higher bid price than you could cause you to lose your spot instance. Back in November, AWS changed the way Spot works to now only terminate spot instances when there isn’t enough capacity to fulfill an On Demand request or the market price rises above your max specified. The market price previously re-evaluated every few minutes, but now happens less than a handful of times per day, if it happens at all that day. This means the spot price is far more stable than in the past. There may be other customers willing to pay more than you, but if AWS doesn’t re-evaluate the market, the price won’t change and your instance won’t be terminated.

What about Reserved Instances?!

Reserved Instances tell Amazon that you want to run a given On Demand instance for 1 or 3 years, and by reserving it, they give you a discount. Depending on the options, you get on average 30% to 50% off the On Demand price. If you choose to buy that reservation in a specific Availability Zone, then it also provides guaranteed capacity. It’ll always be there for you to use. Getting that reserved capacity comes at a trade-off, you have to use that specific AZ. If you’re experiencing an outage related to an AZ, you can’t move that reserved capacity to another. You’re stuck either buying reservations in multiple AZs or running On Demand in a different AZ. If your footprint is small (like a few instances), running RI might make sense.

A False Sense of Security

Switching to spot increases your operational integrity:

The volatility of the instances forces you to automate their instantiation and configuration. Launch configs and configuration management tools ensure your instances come back the way you expect. Combined with immutable deployments, you have a rock solid infrastructure.
The 2 minute warning forces you to think about your workload in pieces no larger than 2 minutes. Running longer jobs exposes you to more and more risk. Handling smaller blocks of work results in less cleanup if something goes wrong. It’s possible that you can’t break the work down into chunks smaller than 2 minutes, but you still have to handle when a job doesn’t complete before the instance goes away. If you’re automating this process, you have less human intervention in the event of an instance termination. By thinking about spot, you’re also preparing for the loss of instances for other reasons.

A Diversified Portfolio

To gain a high level of confidence in available capacity, I use Spot Fleets. Spot fleets allow you to define which instance types you want to use and the most you’d like to pay per instance hour. Fleets also function like Auto Scaling Groups, with options for scaling actions and maintaining instance counts. In a given spot fleet, you have a concept of an Instance Pool, which is a combination of an instance type and an AZ. If you have a spot fleet with 4 instances in 3 AZs, you have (4 x 3) 12 Instance Pools. You can use the AvailableInstancePoolsCount metric in CloudWatch to see how diverse your fleets are. Specifying a max price in your fleet configuration lower than the current market price reduces this metric, so it’s a good metric to monitor for overall health of your fleets.

How Do I Get Started?

The first thing needed to properly leverage Spot Instances is automated instance deployment. Either by using configuration management software or machine images to ensure your instances are configured the way you need when they boot up. I’ve used both methods and heavily recommend a software-based approach.

While working on that, you should also look at your applications and make sure they can gracefully shut down within 2 minutes of notification. If they can’t, you’ll want to have an automated process for recovering the state of that application when you lose an instance. It’s just an additional benefit that this helps the health of your apps.

These two steps will get you most of the way there. It’s just a matter of reading the docs and putting it all together. Or, find someone who’s already done it.

If you make the leap to Spot Instance, I’d love to hear about your experiences. Send me a message here, or in email: levi@levimccormick.com