label Semiconductor

ARM Mali GPUs: A Decade of Mobile Graphics, Dissected

person

SOKYO Labs Author

calendar_today April 10, 2026

schedule 8 min read

visibility 8 views

The story of mobile graphics is, in large part, the story of ARM Mali. That is not a marketing claim. It is a structural reality of how the semiconductor industry works. ARM does not manufacture chips. It designs processor and GPU architectures and licenses them to silicon vendors, who then integrate those designs into their own system-on-chip products. What this means in practice is that when you hold a MediaTek-powered phone, or a Kirin device from HiSilicon, or a Samsung Exynos chip from an earlier era, the GPU doing the rendering work beneath the surface almost certainly carries ARM's fingerprints. Over the course of roughly fifteen years, that GPU lineage has evolved through five distinct microarchitectural generations, each one solving specific problems that its predecessor exposed, each one carrying forward a design philosophy shaped by the unique constraints of mobile computing.

The earliest chapter belongs to the Utgard family, beginning with the Mali-200 in 2006 following ARM's acquisition of Norwegian startup Falanx Microsystems, which had been working on GPU IP for embedded devices. Utgard was a non-unified shader architecture, meaning it maintained discrete pixel and vertex shader pipelines rather than a single programmable pool of execution resources. This design mirrored the GPU paradigms of the early 2000s desktop world, which is unsurprising given when it was conceived, but it placed hard limits on how efficiently workload could be balanced between different rendering tasks. The Mali-400 became the most commercially significant product of this era, appearing in Samsung's Exynos 4210 in 2011 and reaching a scale of deployment that made it, for a time, one of the most widely used mobile GPUs on the planet. It was capable and it was cheap to license, but it was already architecturally dated by the time it reached mass market devices.

The Midgard architecture, introduced in 2010 with the Mali-T604, represented the first serious generational leap. Midgard moved to a unified shader model, where the same pool of execution units could handle vertex, fragment, and compute workloads interchangeably, dynamically allocating resources based on what the pipeline needed at any given moment. This was a significant step toward proper programmable GPU design. Midgard also introduced Hierarchical Tiling, a technique central to how mobile GPUs handle rendering fundamentally differently from their desktop counterparts. Rather than rendering the entire frame in a single pass through a framebuffer too large to fit comfortably in fast on-chip memory, Mali uses a tile-based deferred rendering approach: the scene is divided into small tiles, typically 16x16 pixels, each rendered entirely within on-chip memory before being written out to DRAM. This dramatically reduces the number of expensive off-chip memory accesses per frame, which in a thermally constrained mobile device is not a performance optimization but a survival mechanism. The Midgard generation also introduced Forward Pixel Kill with its second iteration, a visibility determination technique that discards occluded pixels before they are shaded, reducing wasted work in scenes with significant depth complexity. The Mali-T760, T860, and T880 carried Midgard through to 2016 and powered devices ranging from budget tablets to flagship handsets. The architecture aged reasonably well but its shader core design was beginning to show its limits as workloads shifted toward more compute-intensive tasks.

Bifrost arrived in 2016 with the Mali-G71 and brought with it a fundamentally redesigned execution model. Where Midgard's shader cores processed wavefronts of pixels in a conventional SIMD style, Bifrost introduced a quad-based execution model where groups of four threads share an execution engine but execute independently, eliminating some of the divergence penalties that hurt performance in complex branching shader code. The Mali-G72 followed in 2017 with incremental improvements, but the architecturally significant Bifrost release was the G76 in 2018, which doubled execution lane width to eight per engine and scaled the maximum core count to twenty, meaningfully increasing raw throughput at comparable power budgets. The G76 was the chip that appeared in early iterations of the Kirin 980 and helped establish that ARM's GPU could be taken seriously in flagship territory, even if it still trailed Qualcomm's Adreno in many real-world benchmarks. Bifrost was competent and well-engineered, but its per-core efficiency and the thermal behavior under sustained load remained points of legitimate criticism.

The Valhall generation began in 2019 with the Mali-G77 and constituted the deepest architectural rethink ARM had attempted in years. Valhall replaced Bifrost's quad-based model with a superscalar execution engine built around FMA units capable of executing multiple independent instructions per cycle, a design more influenced by modern compute GPU architectures than by the mobile graphics conventions that had shaped Midgard and Bifrost. The G77 and its successor G78, which appeared in 2020, delivered roughly 30% and 15% performance improvements over their predecessors respectively while improving power efficiency in roughly corresponding proportions. The G78 in particular reached up to 24 shader cores in its maximum configuration, appeared in the Dimensity 9000 and various Exynos parts, and represented the first time a Mali GPU genuinely competed with Adreno in sustained gaming workloads rather than just peak benchmark conditions. The third-generation Valhall G710, announced in 2021, introduced a new Command Stream Frontend to replace the older Job Manager architecture, which had been a source of driver complexity and occasional performance unpredictability. The CSF brought the scheduling model closer to what modern discrete GPU driver stacks look like, giving developers more precise control over command submission and synchronization.

The fourth-generation Valhall iteration in 2022 brought the change that the mobile GPU space had been waiting for: the Immortalis-G715 and with it, hardware-accelerated ray tracing. This was the launch that formalized the Immortalis naming tier for ARM's highest-performing GPUs, distinguishing chips with dedicated ray tracing units from the Mali-branded parts that share the same core architecture but omit the RT hardware. The G715 also doubled the FMA block within each execution engine and added matrix multiply instruction support, directly targeting the inference workloads that were beginning to appear more frequently in mobile applications. ARM claimed 15% better performance and 15% better energy efficiency over the G710, which were credible incremental gains, but the ray tracing capability was the headline. Hardware RT on mobile had existed in name on some platforms, but the G715 was the first time it was implemented in a way that game developers could plausibly target it for real-time workloads.

The 5th Generation architecture arrived in 2023 with the Immortalis-G720, Mali-G720, and Mali-G620, and introduced Deferred Vertex Shading as the central innovation. DVS delays the execution of vertex shaders until the tiling stage has determined which vertices actually contribute to visible pixels, avoiding the cost of processing geometry that will never appear on screen. In scenes with high geometric complexity, which mobile game engines were increasingly targeting, the bandwidth savings were substantial. ARM documented up to 40% reductions in memory bandwidth across real game titles such as Genshin Impact and Fortnite, numbers that translate directly into lower power consumption and extended thermal headroom. Tile processing throughput doubled from 32x32 to 64x64 pixel granularity, improving efficiency for high-resolution rendering. The G720 landed first in MediaTek's Dimensity 9300, bringing all-prime CPU core architecture and the new GPU together in a package that genuinely shifted the competitive balance in Android's flagship tier.

The current generation, ARM's 6th Generation in GPU architecture terms, produced the Immortalis-G925 in 2024. The headline numbers are large: 37% better peak gaming performance versus the G720 in a 14-core configuration, 30% lower power at equivalent performance, and 46% average improvement across popular mobile titles. A 24-core maximum configuration was introduced, expanding headroom for SoC vendors who want to prioritize GPU performance at the cost of die area. Fragment Prepass, a new addition in this generation, culls occluded fragments before the full fragment shading pipeline runs, reducing wasted arithmetic on geometry hidden behind other objects, an optimization that reduces CPU overhead by up to 43% in geometry-heavy scenes while also cutting GPU cycles. On the AI side, a 24-core G925 delivers up to 34.4 TOPS of inference throughput, a number that reflects how seriously ARM is treating on-device machine learning as a GPU workload rather than something delegated entirely to a discrete NPU. The Immortalis-G925 shipped first in MediaTek's Dimensity 9400, which powers the Vivo X200 Pro among others, and represents the current state of the art for ARM GPU design in mobile devices.

What this entire lineage reveals, read as a continuous technical narrative, is the story of a company that began with a small and cheap embedded GPU and was forced, generation by generation, to solve increasingly hard problems: how to render efficiently without desktop-scale memory bandwidth, how to handle divergent shader workloads in thermally constrained environments, how to compete with Qualcomm's in-house Adreno teams who had the advantage of co-designing GPU and SoC together. The introduction of tile-based deferred rendering was the foundational insight. Unified shading was the structural maturation. Valhall's superscalar engine was the performance inflection point. DVS was the bandwidth breakthrough. Immortalis was the acknowledgment that mobile graphics had grown up enough to deserve hardware ray tracing. Each step was earned, not given, and the 10 billion ARM GPUs that have shipped across device categories as varied as flagship smartphones, set-top boxes, automotive dashboards, and wearables represent the aggregate of those engineering decisions reaching the real world. The Mali G1 family, which ARM has now introduced as the next platform with Ultra, Premium, and Pro tiers built on the 5th Generation architecture with an updated RT unit, suggests the lineage is not slowing. If anything, the surface area it is being asked to cover, gaming realism, on-device AI inference, and the kind of sustained compute workloads that were once reserved for discrete hardware, is only expanding.

label Tags