GCC Guide for Ampere Processors

This article was originally published by Ampere Computing.

This paper describes how to effectively use GNU Compiler Collection (GCC) options to help optimize application performance on Ampere Processors.

When attempting to optimize an application, it is essential to measure if a potential optimization improves performance. This includes compiler options. Using advanced compiler options may result in better runtime performance, potentially at the cost of increased compile time, more debug difficulties, and often increased binary size. Why compiler options affect performance is beyond the scope of this paper, although the short answer is that code generation, modern processor architectures and how they interact are very complicated! Another important point is that different processors may benefit from different compiler options because of variations in computer architecture, and the specific microarchitecture. Repeated experimentation with optimizations is key to performance success.

How to measure an application’s performance to determine the limiting factors, as well as optimization strategies have already been covered in articles previously published. The paper, The First 10 Questions to Answer While Running on Ampere Altra-Based Instances, describes what performance data to collect to understand the entire system’s performance. A Performance Analysis Methodology for Optimizing Ampere Altra Family Processors explains how to optimize effectively & efficiently using a data-driven approach.

This paper first summarizes the most common GCC options with a description of how these options affect applications. The discussion then turns to present case studies using GCC options to improve performance of VP9 video encoding software and MySQL database for Ampere Processors. Similar strategies have been effectively used to optimize additional software running on Ampere Processors.

GCC Recommendations

The GCC compiler provides many options that can improve application performance. See the GCC website for details. To generate code that takes advantage of all the performance features available in Ampere Processors, use the gcc -mcpu option.

To use the gcc -mcpu option, either set the CPU model or tell GCC to use the CPU model based on the machine that GCC is running on via -mcpu=native. Note on legacy x86 based systems, gcc -mcpu is a deprecated synonym for -mtune, while gcc -mcpu is fully supported on Arm based systems. See Arm’s guide to Compiler flags across architectures: -march, -mtune, and -mcpu for details.

In summary, whenever possible, use only -mcpu and avoid -march and -mtune when compiling for Arm. Below is a case study highlighting performance gains by setting the gcc -mcpu option with VP9 video encoding software.

Setting the -mcpu option:

-mcpu=ampere1: Generate code that will run on AmpereOne Processors. AmpereOne is the next generation of Cloud Native Processors from Ampere, extending the family of high-performance processors to new industry leading core counts. Note, this can generate code that will not run on Ampere Altra and Altra Max Processors. This option was initially available in GCC version 12.1 and later, then backported to GCC 10.5 and GCC 11.3.
-mcpu=neoverse-n1: Generate code that will run on Ampere Altra, Ampere Altra Max as well as Ampere AmpereOne. While using this option for code that will run on Ampere AmpereOne is supported, it will potentially not take advantage of all the new performance features available. Note, GCC version 9.1 or higher is required to enable CPU specific tunings for Ampere Altra and Ampere Altra Max processors.
-mcpu=native: Generate code setting the CPU model based on the CPU GCC is running on. Note, GCC version 9.1 or higher is required to enable CPU specific tunings for Ampere Altra and Ampere Altra Max processors.

Using -mcpu=native is potentially easier to use, although it has a potential problem if the executable, shared library, or object file are used on a different system. If the build was done on an Ampere AmpereOne Processor, the code may not run on an Ampere Altra or Altra Max Processor because the generated code may include Armv8.6+ instructions supported on Ampere AmpereOne Processors. If the build was done on an Ampere Altra or Altra Max processor, GCC will not take advantage of the latest performance improvements available on Ampere AmpereOne Processors. This is a general issue when building code to take advantage of performance features for any architecture.

The following table lists what GCC versions that support Ampere Processor -mcpu values.

Processor	-mcpu Value	GCC 9	GCC 10	GCC 11	GCC 12	GCC 13
Ampere Altra	neoverse-n1	≥ 9.1	ALL	ALL	ALL	ALL
Ampere Altra Max	neoverse-n1	≥ 9.1	ALL	ALL	ALL	ALL
AmpereOne	ampere1	N/A	≥ 10.5	≥ 11.3	≥ 12.1	ALL

Our recommendation is to use the gcc -mcpu option with the appropriate value described above (-mcpu=ampere1, -mcpu=neoverse-n1 or -mcpu=native) with -O2 to establish a baseline for performance, then explore additional optimization options and measuring if different options improve performance compared to the baseline.

Summary of common GCC options:

-mcpu Recommended when building on Ampere Processors to enable processor specific tuning and optimizations. (See discussion “Setting the -mcpu option” section above for details.)
-Os Optimize to reduce code size, potentially if your application is limited by fetching instructions.
-O2 Considered standard GCC optimization option and good to use as a baseline to compare with other GCC options.
-O3 Adds additional optimizations to generate more efficient codes for loops, useful to try if your application performance is dominated by time spent in loops.
Profile Guided Optimization (PGO): -fprofile-generate & -fprofile-use. Generate profile data that the compiler will use to potentially make better decisions on optimizations such as inlining, loop optimizations and default branches. This is considered an advanced optimization as it requires changes to the build system, see below.
Link-Time Optimization (LTO): -flto. Enable link-time optimizations, allowing the compiler to optimize across individual source files. This enables functions to be inlined across source files among other compiler optimizations. This is also considered an advanced optimization and potentially requires changes to the build system. This option increases overall build time, which can be dramatic for large applications. It is possible to use LTO just on performance critical source files to potentially decrease build times.

VP9 Video Encoding Case Study with gcc -mcpu

VP9 is a video coding format developed by Google. libvpx is the open-source reference software implementation for the VP8 and VP9 video codecs from Google and the Alliance for Open Media (AOMedia). libvpx provides significant improvement in video compression over x264 with the expense of additional computation time. Additional information on VP9 and libvpx is available on Wikipedia.

In this case study, the VP9 build is configured to use the gcc -mcpu=native option to improve performance. As mentioned above, use the -mcpu option when compiling on Ampere Processors to enable CPU specific tuning and optimizations. Initially libvpx was built using the default configuration and then rebuilt using -mcpu=native. To evaluate VP9 performance, a 1080P input video file, original_videos_Sports_1080P_Sports_1080P-0063.mkv from the YouTube’s User Generated Content Dataset was used. See Ampere’s ffmpeg tuning and build guide for details on how to build ffmpeg and various codecs including VP9 for Ampere Processors.

Default libvpx Build:

$ git clone https://chromium.googlesource.com/webm/libvpx
$ cd libvpx/
$ export CFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wuninitialized -Wunused -Wextra -Wundef -Wframe-larger-than=52000 -std=gnu89"
$ export CXXFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdisabled-optimization -Wextra-semi -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wmissing-declarations -Wuninitialized -Wunused -Wextra -Wno-psabi -Wc++14-extensions -Wc++17-extensions -Wc++20-extensions -std=gnu++11 -std=gnu++11"
$ ./configure
$ make verbose=1 
$ ./vpxenc --codec=vp9 --profile=0 --height=1080 --width=1920 --fps=25/1 --limit=100 -o output.mkv /home/joneill/Videos/original_videos_Sports_1080P_Sports_1080P-0063.mkv --target-bitrate=2073600 --good --passes=1 --threads=1 –debug

How to Optimize libvpx Build with -mcpu=native

$ # rebuild with -mcpu=native
$ make clean
$ export CFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdeclaration-after-statement -Wdisabled-optimization -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wimplicit-function-declaration -Wmissing-declarations -Wmissing-prototypes -Wuninitialized -Wunused -Wextra -Wundef -Wframe-larger-than=52000 -std=gnu89"
$ export CXXFLAGS="-mcpu=native -DNDEBUG -O3 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=0 -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -Wall -Wdisabled-optimization -Wextra-semi -Wfloat-conversion -Wformat=2 -Wpointer-arith -Wtype-limits -Wcast-qual -Wvla -Wmissing-declarations -Wuninitialized -Wunused -Wextra -Wno-psabi -Wc++14-extensions -Wc++17-extensions -Wc++20-extensions -std=gnu++11 -std=gnu++11"
$ ./configure 
$ make verbose=1 
# verify the build uses the sdot dot product instruction:
$ objdump -d vpxenc | grep sdot | wc -l
128
$ ./vpxenc --codec=vp9 --profile=0 --height=1080 --width=1920 --fps=25/1 --limit=100 -o output.mkv /home/joneill/Videos/original_videos_Sports_1080P_Sports_1080P-0063.mkv --target-bitrate=2073600 --good --passes=1 --threads=1 --debug

An investigation using Linux perf to measure the number of CPU cycles in the functions that took the most time include the functions vpx_convolve8_horiz_neon and vpx_convolve8_vert_neon. The libvpx git repository shows these functions were optimized by Arm to use the Armv8.6-A USDOT (mixed-sign dot-product) instruction which is supported by Ampere Processors.

The CPU cycles spent in vpx_convolve8_horiz_neon was reduced from 6.07E+11 to 2.52E+11 using gcc -mcpu=native to enable the dot product optimization on an Ampere Altra processor, reducing the CPU cycles by a factor of 2.4x.

For vpx_convolve8_vert_neon, the CPU cycles were reduced from 2.46E+11 to 2.07E+11, for a 16% reduction.

Overall, using -mcpu=native to enable the dot product instruction sped up transcoding the file original_videos_Sports_1080P_Sports_1080P-0063.mkv by 7% on an Ampere Altra processor by improving the application throughput. The following table shows data collected using the perf record and perf report utilities to measure CPU cycles and instructions retired.

Build Config	Symbol	Cycle(%)	Cycles	Instructions(%)	Instructions
Default Build	vpx_convolve8_horiz_neon	8.72	6.07E+11	7.52	1.13E+12
	vpx_convolve8_vert_neon	3.53	2.46+E11	2.51	3.78E+11
	Entire Application	100	6.97E+10	100	1.48E+11
-mcpu=native	vpx_convolve8_horiz_neon	3.89	2.52E+11	3.87	5.71E+11
	vpx_convolve8_vert_neon	3.19	2.07+E11	3.29	4.86E+11
	Entire Application	100	6.48E+10	100	1.48E+11

GCC Profile Guided Optimization

This section provides an overview of GCC’s Profile Guided Optimization (PGO) and a case study of optimizing MySQL with PGO. Profile Guide Optimizations enable GCC to make better optimization decisions, including optimizing branches, code block reordering, inlining functions and loops optimizations via loop unrolling, loop peeling and vectorization. Using PGO requires modifying the build environment to do a 3-part build.

Build application with Profile Guided Optimization, gcc -fprofile-generate.
Run application on representative workloads to generate the profile data.
Rebuild application using the profile data, gcc -fprofile-use.

A challenge of using PGO is the extremely high performance overhead in step 2 above. Due to the slow performance running an application built with gcc -fprofile-generate, it may not be practical to run on systems operating in a production environment. See the GCC manual’s Program Instrumentation Options section to build applications with run-time instrumentation and the section Options That Control Optimization for rebuilding using the generated profile information for additional details.

As described in the GCC manual, -fprofile-update=atomic is recommended for multi-threaded applications, and can improve performance by collecting improved profile data.

When to Use PGO?

With PGO, GCC can better optimize applications by providing additional information such as measuring branches taken vs. not taken and measuring loop trip counts. PGO is a useful optimization to try and see if it improves performance. Performance signatures where PGO may help include applications with a significant percentage of branch mispredictions, which can be measured using the perf utility to read the CPU’s Performance Monitoring Unit (PMU) counter BR_MIS_PRED_RETIRED. Large numbers of branch mispredictions lead to a high percentage of front-end stalls, which can be measured by the STALL_FRONTEND PMU counter. Applications with a high L2 instruction cache miss rate may also benefit from PGO, possibly related to mis-predicted branches. In summary, a large percentage of branch mispredictions, CPU front end stalls and L2 instruction cache misses are performance signatures where PGO can improve performance.

MySQL database GCC PGO Case Study

MySQL is the world’s most popular open-source database and due to the huge MySQL binary size, is an ideal candidate for using GCC PGO optimization. Without PGO information, it is impossible for GCC to correctly predict the many different code paths executed. Using PGO greatly reduces branch misprediction, L2 instruction cache miss rate and CPU front end stalls on Ampere Altra Max Processor.

Summarizing how MySQL is optimized using GCC PGO:

sysbench was used to evaluate MySQL performance
GCC PGO was trained using MySQL MTR (mysql-test-run) test suite
Sysbench’s oltp_point_select and oltp_read_only tests were used to measure performance with PGO build compared to the default build
The number of threads used were then varied from 1 to 1024, giving an average speed up of 29% for the oltp_point_select and 20% for the oltp_read_only test on an Ampere Altra Max M128-30 processor
With 64 threads, PGO improved performance by 32% by improving MySQL’s throughput

Additional details can be found on the Ampere Developer’s website in the MySQL Tuning Guide.

Summary

Optimizing applications requires experimenting with different strategies to determine what works best. This paper provides recommendations for different GCC compiler optimizations to generate high performing applications running on Ampere Processors. It highlights using the -mcpu option as the easiest way to generate code that takes advantage of all the features supported by Ampere Cloud Native Processors. Two case studies, for MySQL database and VP9 video encoder, show the use of GCC options to optimize these applications where performance is critical.

Built for sustainable cloud computing, Ampere’s first Cloud Native Processors deliver predictable high performance, platform scalability, and power efficiency unprecedented in the industry. We invite you to learn more about our developer efforts and find best practices at developer.amperecomputing.com and join the conversation at community.amperecomputing.com.