Using CPU Performance Benchmarking to Determine Application Sizing

Brent Segner
7 min readMar 17, 2024

--

With the recent shift in the public cloud landscape towards emerging technologies such as GPU instances and edge computing, one area receiving less attention is how differentiated the traditional CPU landscape has become. Although it is not the leading discussion board topic, the latest versions of chips continue to be a marvel innovation as they improve power consumption & performance across a diverse group of workloads. With the number of options in types of processors and virtual machine sizes, it is becoming increasingly more work to determine what type and size of virtual instance best suits an application’s needs.

For this reason, having a well-defined approach toward CPU performance benchmarking is becoming even more crucial for success in this changing computing technology landscape. At the highest level, effective benchmarking is a yardstick for comparing the capabilities of different processors and instance sizes to determine their ability to support various tasks and workloads. With this level of insight, enterprises can make informed decisions when selecting hardware for their needs, whether for machine learning, scientific simulations, or enterprise-level computing. There are few absolutes available within the CPU performance benchmarking domain, so this article will avoid making any direct observations (e.g., instance type “X” is better than instance type “Z”). Instead, it will explore using three typical approaches for CPU performance benchmarking with each other.

Linear Extrapolation, Synthetic & Real World Benchmarking

The discipline of CPU performance benchmarking can encompass various approaches, each tailored to meet certain use cases. Regardless of the approach, the overarching aim is to determine the correct size and type of infrastructure needed to meet the performance needs of an application. This article will focus on three common approaches: linear extrapolation, synthetic testing, and real world application. While any approach to CPU benchmarking could be used independently, they all carry unique pros and cons that encourage using multiple methods to have the most significant accuracy in the performance engineering process.

Linear Extrapolation

The first approach, linear extrapolation, is a relatively coarse benchmarking method. Typically, organizations employ this methodology in activities such as moving resources from an on-premises environment to the public cloud. The focus is mainly on getting a rough order of magnitude for the number and type of CPU resources needed to meet a workload’s demand. Linear extrapolation assumes that an individual CPU can achieve a certain amount of performance and that as the number of resources increases in different sizes of virtual machines (VMs), the work capacity of the instance grows at a proportional rate. The example below shows how the theoretical work capacity of a virtual machine (VM) would increase as additional CPUs are added to the instance.

Example: Using an AWS c5 instance model, if the assumption is that each vCPU could support 63.5k units of work, then a c5.large (2 vCPU) VM could sustain 127k units of work. When scaled to the maximum size of a c5.24xlarge (96 vCPU) VM, the instance could support up to 6.1M units of work at peak load.

One of the most significant benefits of this approach is that it allows for a quick assessment of a VM’s theoretical work capacity based on sizing. Once an initial baseline has been established for a specific processor (or class of VMs), others can be compared relatively easily by accounting for the increase or decrease in capacity by accounting for the difference in processor frequency. Example: A 3 GHz Intel Cascade Lake processor should have a ~30% performance improvement over a 2.5 GHz Intel Skylake processor.

The trade-off of this approach is that it enables ease of benchmarking, but it comes at the expense of reduced accuracy. Several factors reduce the accuracy of performance predictions using linear extrapolation. Several of the most significant include a degradation in performance at a CPU level for each one added to a VM. These performance impacts occur because activity crosses the NUMA boundary or because of application inefficiencies described through Amdahl’s Law, resulting in a more significant impairment on a per vCPU basis as the VM scales in size.

Synthetic Benchmarking

Another approach to CPU performance benchmarking is known as synthetic benchmarking. This process involves running a series of artificial tasks to stress specific CPU components. These processes typically include arithmetic operations, memory access, or floating-point calculations. Since these synthetic benchmarks are created by performing repeatable tests, they provide standardized metrics that facilitate direct comparisons between different processors, making them valuable tools for assessing raw computational power and efficiency. The list below, taken from the ARM site, illustrates some of the more common performance benchmarks.

While synthetic tests improve accuracy over linear extrapolation, they may not accurately reflect real-world performance. Several variables need to be addressed within the test setup process to ensure a standardized and consistent benchmarking environment across all tested CPUs. These test variables include CPU clock speed, memory configuration, compiler settings, and operating system optimizations. Any parameter variations can significantly affect benchmarking results, leading to inaccurate comparisons between different processors.

Typically, synthetic benchmarks measure the performance of central processing units in embedded systems by testing basic operations like integer arithmetic, control flow, and memory access. Recognizing that these tests primarily assess low-level computational capabilities rather than simulating real-world workloads is essential. As a result, these synthetic tests can provide valuable insights into raw processing power and efficiency but may only partially represent performance in diverse usage scenarios.

Real World Benchmarking

The third type of CPU performance assessment is known as real-world benchmarking. As the name implies, this type of benchmarking involves assessing the processor’s capabilities across a range of practical tasks and applications representative of typical user workflows. Unlike synthetic benchmarks that focus on isolated aspects of CPU performance, real-world benchmarking aims to provide insights into how processors handle diverse workloads encountered in everyday usage scenarios. This approach considers factors that represent running actual applications and tasks commonly used by end-users. Typically, load generators facilitate this type of testing and execute it in a non-production environment with customized scripting to meet specific requirements.

When performed correctly, real-world benchmarking offers a more holistic perspective on CPU performance, including responsiveness, latency, and overall capabilities. When performing real-world CPU performance benchmarking tests, one must consider factors like system configurations, background processes, and environmental variables and conduct them under realistic conditions. Capturing performance metrics across a spectrum of real-world scenarios allows for more informed decisions when selecting CPUs based on their specific needs and usage patterns.

The drawback to real-world benchmarking is the amount of knowledge about the workloads required up front and the time and effort needed to perform the testing. While this method provides the most accurate view into how an application will perform on a specific resource type(s), depending on where the application is in its migration or development process, much of the information needed to establish the tests and select the instance types may not yet be available.

Conclusion

Ultimately, there are no right or wrong answers regarding approaches toward CPU performance benchmarking but times when one approach might be more appropriate. As discussed throughout the article, linear extrapolation might be most useful at the early stages of migration, when broad strokes are used to understand how many CPUs of a particular type might be needed to meet the workload demands. Once an initial sizing hypothesis is reached and a general understanding of the application architecture (monolithic vs. distributed), performing Synthetic Benchmarking across various instance types and sizes provides a mechanism to reground some initial sizing assumptions. Real-world benchmarking is an important milestone to prove the performance effects and ensure that the CPU type and instance size are appropriate as the application matures and moves closer to supporting production traffic. If the resource is too large for the requirements, then there is a high probability that unnecessary costs will be incurred. If the converse is true and the resource types are insufficient, the application might experience a performance impairment.

Since the most successful CPU performance benchmarking is iterative and often treads the line between art and science, it is essential to continually weigh the observed results from real-world benchmarking against the observations formed during the synthetic benchmarking phase. This assessment will help to guide whether any adjustments can be made to provide the most significant balance between cost and performance. Ultimately, this process has no finish line as applications and their workloads will change. The application sizing and resource assignment should be continually revisited to ensure that the infrastructure they are on is ideal for their respective needs.

--

--

Brent Segner

Distinguished Engineer @ Capital One| Focused on all things FinOps | Passionate about Data Science, Machine Learning, AI, Python & Open Source