Understanding Utilization on GPU Instances

6 min readFeb 11, 2024

Although Natural Language Processing (NLP) and Large Language Models (LLM), as we currently recognize them, have been in development for over forty years, the public release of ChatGPT in 2022 renewed interest in AI. Only the result of giant leaps in GPU processing capabilities and the ability to harness those capabilities towards training machine learning models made that breakthrough moment possible.

This first glimpse behind the curtain of what is possible by leveraging these advanced models has driven an incredible appetite for companies to incorporate GPU instances into their respective tech stacks. While these GPU instances are much more capable than their legacy CPU-only counterparts, they carry a disproportionate expense to legacy infrastructure. As the bills for running these instances either in a public cloud provider or in an on-premise data center start to come due, the inevitable question of how efficiently they are being used quickly follows.

Understanding Utilization on GPU Instances

Three prominent questions start to emerge in this space:

Is the CPU utilization on these instances still an important metric?
How efficiently are we using the GPUs on these instances?
Are there alternate instances that would better meet our performance needs?

Since the third question would send us down a rabbit hole that is too deep for this initial blog, we will stay focused on the first two questions for this article. The first question opens the door to a new paradigm for compute performance. Historically, the CPU did the majority of the heavy lifting for processing, with the increasingly more powerful processors able to handle sequential instructions faster and faster. While the improved performance has allowed for several tremendous advances in computational power, it did not offer the performance breakthrough required to train models efficiently. When GPUs started to become mainstream in the late 1990s to unlock better graphics performance, researchers quickly realized the tremendous potential they had to improve performance in operations that required highly parallelized operations.

Looking strictly through the lens of model training & serving performance, the role of the CPU primarily moves into the position of a controller. Its primary function is passing on tasks that the GPU can execute in parallel. While the GPU is processing the requests, the CPU remains idle until it receives the response that the job has been completed and the next task can be assigned. Since the CPU no longer directly facilitates the work associated with the model training, the CPU utilization metrics become a much less reliable barometer of instance performance. With that said, the CPU utilization metric cannot be discarded entirely as abnormally high or abnormally low levels of utilization may indicate other issues, such as an imbalance in the ratio of CPU to GPU.

**Example of CPU interaction with GPU offloading tasks for parallel execution**

Limitations of GPU Utilization Metrics

With CPU utilization no longer a reliable guide on how heavily an instance is being used, the focus turns to the second question: “How efficiently are the GPUs being utilized.” Unfortunately, it is not quite as simple as turning towards the readily available GPU Utilization metrics. The reasons are laid out eloquently within this blog but can be summarized by the NVPL as follows:

GPU utilization: This represents the percentage of time during which one or more kernels were executing on the GPU.
Memory utilization: This represents the percentage of time during which global (device) memory was being read from or written to.

The critical thing to note is that this changes the perspective to report when ANY of the resources were being consumed rather than the volume of available capacity being used. The unfortunate net result is that a GPU has the potential to report 100% utilization while performing very little work. This departure from the historical interpretation of utilization requires a switch in perspective from measuring utilization alone to measuring saturation. In this context, saturation is the degree to which the resource has extra work it can’t service.

Deriving Saturation

The Utilization, Saturation, Error (USE) method was initially developed to identify and resolve performance issues within a system. Since then, evaluating performance across various infrastructures has become an objective methodology. Given the complex architecture of GPU instances and the different impacts that various models can have on the infrastructure, the USE method lends itself well to evaluating GPU performance.

Limiting the focus of this exercise strictly to GPU Instances currently hosted on AWS, the following telemetry is available for evaluation. A quick examination of the telemetry under the guise of the Linux Performance Rosetta shows that no metrics allow for direct inference of the degree of saturation occurring on the GPU, meaning the measurement must be indirectly inferred. The two measurements that give us the most significant degree of visibility into how much of a GPU work capacity has been consumed are:

Although there should be a strong positive linear correlation between the power consumed by a GPU and the temperature, it is preferable to use both when calculating saturation. Using both the power and thermal data reduces the chances of a particular instruction set operations, which could cause a bias in the results.

Saturation Calculation Example:

In the example of a Nvidia A10 GPU, the specifications show that the model has a maximum power draw of 150W and a maximum operating temperature of 50 degrees Celsius. Using these wattage and thermal specifications within the context of a g5.12xlarge instance, in the example below, each of the individual GPUs is measured individually to determine their relative saturation level. The results are then aggregated across all GPUs to represent the work capacity consumed for the entire instance.

Layering in Error Data

As the level of saturation increases, the last element in understanding how effectively the instance is being utilized is the error data. This type of information would come through Xid events in the case of AWS GPU-based instances to increase the saturation to the maximum level while minimizing the error events. These events contain error codes that could indicate problems with the application, the GPU driver, or the hardware or be informational. Since not every error is associated with an increasing saturation level, the number code associated with each event must be interpreted to determine any basis for correlation or causation. Although not a complete list, the following error codes may be associated with a GPU becoming fully saturated [24 — GPU semaphore timeout; 26 — Frame Buffer timeout]. The ideal state of GPU saturation would be one in which the derived saturation levels are high (>80%), with a nominal number of capacity-related errors being produced.

Conclusion

The current generation of GPU-based instances offers users a powerful tool to help train and service machine learning models that will deliver on the promise of artificial intelligence. With the inevitable shift towards GPU instances for model training and serving, understanding how to measure their performance is critical. As discussed throughout this blog, workload constitutes an essential part of pushing them to their maximum level of performance. With the limited visibility into GPU performance data currently available through CloudWatch, leveraging the USE method with a specific emphasis on evaluating saturation with wattage and thermal data looks extremely promising.

In the spirit of continually pursuing new ideas and best practices and understanding that there are other ways to pursue the same objectives, I would love to hear any performance engineering thoughts or ideas that others have come to use in this same space.

Note: The information and perspectives held within this blog represent my personal opinion and not that of my employer or foundations that I am affiliated with.

Understanding Utilization on GPU Instances

Written by Brent Segner