Metal optimization: ways to improve code created for the Apple graphics environment

There are many ways to optimize Metal graphics code for maximum performance. Here's how to start getting your code in better shape for the Metal platform.

Apple GPU Architecture

Apple GPUs are tile-based deferred renderers, which means that they use two main passes: tiling and rendering. The general rendering pipeline is shown below.

These two stages can be thought of as one where the geometry is calculated and created, and another where all the pixels are rendered.

In most modern Apple GPU software, geometry is calculated and divided into meshes and polygons, and then converted into a pixel image, one image per frame.

Modern Apple GPUs have specific subsections in each core that handle shaders, textures, pixel processing, and dedicated tile memory. Each core uses these four regions during rendering.

Rendering each frame uses multiple passes across multiple GPU cores, with each core handling multiple tasks. In general, the more cores, the better the performance.

GPU counters

GPU counters are used to measure this performance.

GPU counters monitor the load on each GPU and determine whether each one is doing enough work or not enough. They also find performance bottlenecks.

Finally, GPU counters optimize the commands that take the longest to improve performance.

There are over one hundred and fifty types of Apple GPU performance counters, and describing them all is beyond the scope of this article.

There is a problem understanding all the performance counter data. To do this, you use the Metal System Trace and Metal Debugger built into Xcode and Instruments.

We covered Metal System Trace and Debugger in the previous Metal article.

There are four Metal GPU counters that include important ways to optimize Metal in your applications and games. These are:

  1. Performance Limiters
  2. Memory Bandwidth
  3. Occupancy
  4. Hidden Surface Removal

Performance limiters, or limiter counters, measure the activity of multiple GPU subsystems, identifying the work being done and detecting latencies that may be blocking or slowing down. parallel execution.

Modern GPUs perform math, memory, and rasterization in parallel (at the same time). Performance limiters help you identify performance bottlenecks that are slowing down your code.

You can use the Apple Instruments app to use performance limiters to optimize your code. Instruments has half a dozen different performance limiters.

Apple Instruments application.

Memory Bandwidth Counters


GPU Memory Bandwidth Counters measure data transfer between the GPU and the system memory. The GPU accesses system memory whenever buffers or textures are accessed.

But keep in mind that system-level caches can also be activated, meaning that you may sometimes notice small bursts of higher memory bandwidth than the actual DRAM transfer speed. This is fine.

If you see a high memory bandwidth counter, it most likely means that transfers are slowing down rendering. To eliminate these bottlenecks, you can do several things.

One way to reduce memory throughput slowdown is to reduce the size of working datasets. This speeds up the process because less data is transferred from system memory.

Another way is to load only the data needed for the current render pass, and save only the data needed for future render passes. This also reduces the overall data size.

You can also use block-based texture compression (ASTC) to reduce the size of texture resources and lossless compression for textures generated at runtime.

Occupancy measures how many threads are currently running from the total thread pool. 100% utilization means that a given GPU is currently maxed out in terms of the number of threads and overall work it can handle.

The GPU busy counter measures the percentage of total thread bandwidth used by the GPU. This sum is the sum of computation, vertex occupancy, and fragment occupancy.

Hidden surface removal typically occurs somewhere in the middle of each render pass before the fragment is processed—shortly after the tiled vertex buffer is sent to the GPU for rasterization.

Depth buffers and hidden surface removal are used to eliminate any surfaces that are not visible to the view camera in the current scene. This speeds up productivity because these surfaces don't need to be painted.

For example, the surfaces on the back of opaque 3D objects don't need to be drawn because the camera (and viewer) never sees them, so there's no point in painting them.

Surfaces hidden by other 3D objects in front of them relative to the camera are also removed.

GPU counters can be used during hidden surface removal to determine the total number of pixels rasterized, the number of fragment shaders (actually the number of fragment shader calls), and the number of pixels saved.

GPU counters can also be used to minimize blending, which also results in lower performance.

To optimize painting while removing hidden surfaces, you will need to draw objects in order of visibility state, namely checking if objects are opaque, checking for translucency, and trying to avoid alternating opaque and opaque objects. opaque meshes.


There are many resources on Metal, including the Apple Metal Developer pages at /, WWDC videos, and the excellent third-party book The Metal Programming Guide: Tutorial and Reference via Swift by Janie Clayton.

To get started with Metal optimization, be sure to watch the WWDC videos “Optimizing Metal Apps and Games with GPU Counters” from WWDC20, “Using GPUs with Metal” also from WWDC20, and “Delivering Optimized Apps and Games for Metal” from WWDC19. .

Next, read “Capture a Metal Workload in Xcode” and “Metal Debugging Types” on the Metal Debugger pages on Apple's developer documentation website.

The Metal Debugger documentation also has a section on “Metal Workload Analysis”.

You'll definitely want to spend a lot of time with the Xcode Metal Debugger and Trace documentation to learn in detail how the various GPU counters and performance graphs work. Without them, you won't be able to get a detailed view of what's actually happening in your Metal code.

As for compressed textures, it's also worth reading about Adaptive Scalable Texture Compression (ASTC) and how it works in modern rendering pipelines.

ARM has an excellent overview of ASTC on its developer website, Also visit

Optimizing the performance of metals is a broad and complex topic. We have just begun to study it and will continue to explore this topic in future articles.

Leave a Reply

Your email address will not be published. Required fields are marked *