Who Stole My CPU?
One of the most important features of the cloud is the sharing of resources by multi-tenants. Without sharing and being able to optimize utilization of resources, the cloud operator can’t provide scalability and support “economies of scale” for its business. The IaaS public contains its “cloud magic” as well as real hardware such as computing, storage and network devices. The utilization of these resources should be optimized by meeting demand (by time), hence they must be shared between the cloud consumers.
What is Steal Time?
The basic metric for how a server utilizes its CPU is the idle capacity – the amount of CPU that is free. The CPU utilization compounds from allocations of the following:
- User – the running application
- System – the operating systems
- Interrupt – Hardware interruptions
- Wait – waiting for I/O jobs to end
- Steal – cycles that are not related to the virtual machine
- Idle – no work is being done
Steal time (ST) also referred to as “Stolen CPU”, exists in virtualized computing environments –It is the time that the CPU uses to run internal virtual machine tasks, with the hypervisor allocating CPU cycles to other “external tasks” that are probably caused by one of your noisy neighbors.
ST on Amazon Cloud
I researched this subject on AWS forums and found that when CPU utilization spikes for some time (configured by the cloud operator); the system automatically throttles back the CPU to a few usage percentages, while “stealing” the rest of your CPU. This makes sense as the cloud must protect itself from overload and the threat of crash.
You can find more information with regards to Micro instance type in the Amazon AWS FAQs: “Micro instances provide a small amount of consistent CPU resources and allow you to burst CPU capacity up to 2 ECUs when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically but very little CPU at other times for background processes, daemons” Read more
On the Amazon developers forums you can find the following:
“For example, when the occasion comes where I might need to do a “yum update” the system becomes unresponsive within one minute. I would have expected it to do this at three or five minutes, as it has always done, but today this throttling happens at about thirty seconds to one minute.” Check the thread
Amazon doesn’t detail the actual Xen configuration though they say that: “The instance is designed to operate with its CPU usage at essentially only two levels: the normal low background level, and then at brief spiked levels much higher than the background level.” Read more According to what I learn, monitoring CPU using a standard monitoring tool can mislead the cloud user. For example, Linux instances will not report the proper values for CPU usage due the virtualization layer on the underlying infrastructure. For accurate values for CPU usage on EC2 instances, the cloud user should rely only on the CloudWatch metrics.
Another important aspect regarding CPU utilization is the workload model. I learned that you should differentiate between two workload models – Batch workload and Real-time workload. The former provides greater tolerance for shortage and can wait for an available capacity. The batch model describes a task that generates a steady utilization or aggregated amount of CPU usage, so once there is heavy utilization it will be compensated later on. The real-time workload balance will never be compensated and overloads will be restrained by the cloud operators. Moreover, cloud operators such as Amazon AWS tend to deploy a more batch workload model to control loads on their physical layer.
In order to utilize the AWS micro instances, you need to be able to control your online resources behavior. You could also try playing your web server configuration settings, for example, limiting the number of clients. You should use S3 for hosting static files, such as images, video, and audio. Utilizing other AWS services to support your application performance needs can move some of the load to other cloud resources thereby lowering the overall CPU consumption of your EC2 instances. Anyway, it is important to leverage the elastic environment and deploy horizontal (or vertical) scaling methods to protect the environment.
Join the discussion
Following different Xen configurations for different instances types, such as the rules for micro CPU, I wonder – Do you need to have different auto-scaling CPU thresholds for different instances types ?