AMD’s next-gen APU unifies CPU/GPU memory, should appear in Kaveri, Xbox 720, PS4



Ever since it debuted its first generation of “Fusion” processors with a GPU and CPU on the same die, AMD has talked up its plans for a heterogeneous system architecture, or HSA. Today, the company is talking about the next step in that process (and giving a few hints about which processors will support it). AMD is calling the new approach hUMA, Heterogeneous Uniform Memory Access, but the company’s definition of UMA and NUMA are rather tortured — so much so, in fact, that we’re going to ignore the term.

HSA is an attempt to solve a long-standing problem with system architectures. Your GPU, in a conventional system, is a specialized co-processor with its own pool of RAM. Integrated GPUs might bring the GPU aboard the CPU package or share access to local system memory, but there’s still a sharp delineation between GPU program space and CPU program space. AMD’s first APU, codenamed Llano, essentially combined a conventional CPU and GPU in a single die. Unifying the two components was a major step, but the underlying communication model between them didn’t look much different from the earlier generations of AMD motherboards.

Trinity moved the ball forward on this front by adding shared power management and support for C++ AMP. Now, AMD is talking up the next phase of HSA — adding shared pointers, a bi-directional coherent memory model, and pageable memory. The big-picture way to think about this next generation of HSA support is that it makes it much easier for the CPU and GPU to share data, perform tasks, and communicate with each other regarding the status of  those tasks.


Even after Trinity improved the situation, the path between CPU and GPU is rather laborious. There’s no clean, simple way for both components to access the same areas of memory. This next generation of HSA capability fixes that.


On a system level, moving to a heterogeneous fully coherent memory model looks like this:


The takeaway here is that both CPU and GPU can read and modify the same areas of memory without one waiting on the other to handle the task. This should make it much easier to share resources between the two — it eliminates communication latency and bottlenecks that would otherwise make GPU offloading a needlessly complicated affair.

When will we see it? (And what will it mean?)

When asked, AMD stated that Kaveri (due in the second half of 2013) will be the first chip to use these second-generation HSA features. The G-series embedded parts announced last week, based on Kabini, will not. I’m going to go out on a limb and say I’ll be surprised if this new technology doesn’t show up in the Xbox Durango and PS4, even if the graphics cores in those products are otherwise based on GCN.

Why? Because it makes perfect sense for Microsoft and Sony to adopt this technology. The ability to exchange data and maintain coherency between CPU and GPU is a major benefit in console operations. A recent interview with Mark Cerny at Gamasutra seems to confirm that the PS4 at least will employ AMD’s hUMA tech.

AMD even spoke, at one point, about the idea of using an embedded eDRAM chip as a cache for GPU memory — essentially speaking to the Xbox Durango’s expected memory structure. The following quote comes from AMD’s HSA briefing/seminar:

“Game developers and other 3D rendering programs have wanted to use extremely large textures for a number of years and they’ve had to go through a lot of tricks to pack pieces of textures into smaller textures, or split the textures into smaller textures, because of problems with the legacy memory model… Today, a whole texture has to be locked down in physical memory before the GPU is allowed to touch any part of it. If the GPU is only going to touch a small part of it, you’d like to only bring those pages into physical memory and therefore be able to accommodate other large textures.

With a hUMA approach to 3D rendering, applications will be able to code much more naturally with large textures and yet not run out of physical memory, because only the real working set will be brought into physical memory.”



Leave a Reply