Introduction
A few months ago we conducted a technical deep dive into AI model optimisations, focusing on General Matrix Multiply (GEMM) tuning. Our previous analysis demonstrated that fine-tuning these operations led to a significant improvement in throughput and latency by up to 7.2x.
Building on this foundation, we have conducted updated benchmarks to further explore the potential of AMD’s MI300X GPUs. These new benchmarks provide a deeper understanding of how these improvements in optimisation are transforming computational tasks and performance.
Overview of benchmark process
The majority of benchmarks you’ll find today are synthetic. They heavily rely on randomness, with each test varying in types of request and response variables, such as output length. Although this method can be useful in certain comparisons, its main challenge is that it does not reflect real-world scenarios where conditions are consistent, for example having a fixed content window size and specific inference requests.
To bridge this gap and provide you with benchmark results that reflect real-world scenarios, we have taken a different approach.
Updated benchmark methodology
We have modified the benchmarking process in a way that we have created an environment that mirrors actual inference workloads. This involved adjusting the throughput script from the vLLM repository to allow for a fixed prompt, reproducing a realistic inference scenario rather than a synthetic one.
This adjustment is crucial in showcasing the potential of AMD MI300X GPUs and how they handle real-world tasks, including the ability to manage concurrent requests effectively. The modified script can be found here.
Benchmark set-up
For our benchmarking tests, we used a pre-configured Docker image and ensured the following software and libraries were installed:
- Docker Image
- ROCm 6.2
- Python 3.10.12
- PyTorch 2.5.0
- Composable Kernel
- Flash Attention 2.9
- Gradlib
- vLLM 0.4.3
Our modified benchmark_throughput.py script from the vLLM repository accepts a fixed prompt to initiate real-world scenario throughput benchmarks. This method provides consistent and reliable results to better represent real-world application performance.
Testing
During the testing phase, we focused on scenarios that involved handling concurrent requests and the maintenance of the performance. These two factors are important for real-world applications.
GEMM tuning techniques
By leveraging Gradlib for GEMM tuning, we were able to compare the performance against both TuneableOps and out-of-the-box (OOTB) setups as well as fine-tune our setup to achieve significant latency and throughput improvements.
However, utilising Gradlib to perform Gemm tuning takes a significant amount of time in comparison to the built-in tuneableop in torch, however the results are noticeably better making the extra effort worthwhile.
Key Observations
Throughput
Our updated benchmarks have shown a significant improvement in throughput when using AMD’s MI300X GPUs fine-tuned with Gemm using Gradlib. As our goal was to provide real-world performance, maintaining a consistent workload allowed us to measure the GPU’s ability to handle concurrent requests effectively. We observed that higher batch sizes led to better overall throughput on AMD MI300X GPUs, allowing operations to process simultaneously without compromising performance.
We saw a substantial increase in the number of operations processed per second compared to the out-of-the-box configuration. This improvement underscores the importance of realistic benchmarking environments in capturing the true performance potential of hardware under typical operating conditions.

Latency
In addition to throughput gains, we also observed that the latency was reduced by approximately 33% compared to the out-of-the-box run, showcasing faster response times and enhanced efficiency.
This decrease in latency emphasises the effectiveness of Gemm tuning using Gradlib, providing clear evidence that target optimisation can enhance performance significantly.
It is also interesting to note that in our previous post, using the built-in tuneableop in torch yielded slightly slower results in comparison to gradlib.
These findings are crucial for applications where rapid processing and minimal delay are critical, further demonstrating the benefits of real-world benchmarking approaches.
Conclusion
By modifying our benchmarking approach to reflect real-world scenarios, we have been able to provide relevant performance metrics for AMD MI300X GPU. This showcases the capabilities of MI300X GPUs and the importance of benchmarking in realistic environments.
Although the process does take a while, the insights gained are invaluable for those looking to optimise performance in high-compute environments which reflect real-world tasks and performance.
Discover the complete and updated benchmark results by accessing our full table of results here.
Since completing these benchmarks, there has been further optimisations and updates to the vLLM library. These new improvements may lead to better results in future benchmarks, so stay tuned as Nscale continue to explore the full potential of these optimisations.





.png)


.png)
