Thursday, August 3, 2017

Understanding I/O on the mid-2017 iMac

My wife recently bought me a brand new mid-2017 iMac to replace my ailing, nine-year-old HP desktop.  Back when I got the HP, I was just starting to learn about how computers really worked and really didn't really understand much about how the CPU connected to all of the other ports that came off the motherboard--everything that sat between the SATA ports and the CPU itself was a no-man's land of mystery to me.

Between then and now though, I've somehow gone from being a poor graduate student doing molecular simulation to a supercomputer I/O architect.  Combined with the fact that my new iMac had a bunch of magical new ports that I didn't understand (USB-C ports that can tunnel PCIe, USB 3.1, and Thunderbolt??), I figure I'd sit down and see if I could actually figure out exactly how the I/O subsystem on this latest Kaby Lake iMac was wired up.

I'll start out by saying that the odds were in my favor--over the last decade, the I/O subsystem of modern computers has gotten a lot simpler as more of the critical components (like the memory controllers and PCIe controllers) have moved on-chip.  As CPUs become more tightly integrated, individual CPU cores, system memory, and PCIe peripherals can all talk to each other without having to cross a bunch of proprietary middlemen like in days past.  Having to understand how the front-side bus clock is related to the memory channel frequency all gets swept under the rug that is the on-chip network, and I/O (that is, moving data between system memory and stuff outside of the CPU) is a lot easier.

With all that said, let's cut to the chase.  Here's a block diagram showing exactly how my iMac is plumbed, complete with bridges to external interfaces (like PCIe, SATA, and so on) and the bandwidths connecting them all:




Aside from the AMD Radeon GPU, just about every I/O device and interface hangs off of the Platform Controller Hub (PCH) through a DMI 3.0 connection.  When I first saw this, I was a bit surprised by how little I understood; PCIe makes sense since that is the way almost all modern CPUs (and their memory) talk to the outside world, but I'd never given the PCH a second thought, and I didn't even know what DMI was.

As with any complex system though, the first step towards figuring out how it all works is to break it down into simpler components.  Here's what I figured out.

Understanding the PCH

In the HPC world, all of the performance-critical I/O devices (such as InfiniBand channel adapters, NICs, SSDs, and GPUs) are all directly attached to the PCIe controller on the CPU.  By comparison, the PCH is almost a non-entity in HPC nodes since all they do is provide low-level administration interfaces like a USB and VGA port for crash carts.  It had never occurred to me that desktops, which are usually optimized for universality over performance, would depend so heavily on the rinky-dink PCH.

Taking a closer look at the PCIe devices that talk to the Sunrise Point PCH:



we can see that the PCH chip provides PCIe devices that act as

  • a USB 3.0 controller
  • a SATA controller
  • a HECI controller (which acts as an SMBus controller)
  • a LPC controller (which acts as an ISA controller)
  • a PCI bridge (0000:00:1b) (to which the NVMe drive, not a real PCI device, is attached)
  • a PCIe bridge (0000:00:1c) that breaks out three PCIe root ports
Logically speaking, these PCIe devices are all directly attached to the same PCIe bus (domain #0000, bus #00; abbreviated 0000:00) as the CPU itself (that is, the host bridge device #00, or 0000:00:00).  However, we know that the PCH, by definition, is not integrated directly into the on-chip network of the CPU (that is, the ring that allows each core to maintain cache coherence with its neighbors).  So how can this be?  Shouldn't there be a bridge that connects the CPU's bus (0000:00) to a different bus on the PCH?

Clearly the answer is no, and this is a result of Intel's proprietary DMI interface which connects the CPU's on-chip network to the PCH in a way that is transparent to the operating system.  Exactly how DMI works is still opaque to me, but it acts like an invisible PCIe bridge that glues together physically separate PCIe buses into a single logical bus.  The major limitation to DMI as implemented on Kaby Lake is that it only has the bandwidth to support four lanes of PCIe Gen 3.

Given that DMI can only support the traffic of a 4x PCIe 3.0 device, there is an interesting corollary: the NVMe device, which attaches to the PCH via a 4x PCIe 3.0 link itself, can theoretically saturate the DMI link.  In such a case, all other I/O traffic (such as that coming from SATA-attached hard drive and the gigabit NIC) is either choked out by the NVMe device or competes with it for bandwidth.  In practice, very few NVMe devices can actually saturate a PCIe 3.0 4x link though, so unless you replace the iMac's NVMe device with an Optane SSD, this shouldn't be an issue.

Understanding Alpine Ridge

The other mystery component in the I/O subsystem is the Thunderbolt 3 controller (DSL6540), called Alpine Ridge.  These are curious devices that I still admittedly don't understand fully (they play no role in HPC) because, among other magical properties, they can tunnel PCIe to external devices.  For example, the Thunderbolt to Ethernet adapter widely available for MacBooks are actually fully fledged PCIe NICs, wrapped in a neat white plastic package, that tunnel PCIe signaling over a cable.  In addition, they can somehow deliver this PCIe signaling, DisplayPort, and USB 3.1 through a single self-configuring physical interface.

It turns out that being able to run multiple protocols over a single cable is a feature of the USB-C physical specification, which is a completely separate standard from USB 3.1.  However, the PCIe magic that happens inside Alpine Ridge is a result of an integrated PCIe switch which looks like this:



The Alpine Ridge PCIe switch connects up to the PCH with a single PCIe 3.0 4x and provides four downstream 4x ports for peripherals.  If you read the product literature for Alpine Ridge, it advertises two of these 4x ports for external connectivity; the remaining two 4x ports are internally wired up to two other controllers:

  • an Intel 15d4 USB 3.1 controller.  Since USB 3.1 runs at 10 Gbit/sec, this 15d4 USB controller  should support at least two USB 3.1 ports that can talk to the upstream PCH at full speed
  • an Thunderbolt NHI controller.  According to a developer document from Apple, NHI is the native host interface for Thunderbolt and is therefore the true heart of Alpine Ridge.
The presence of the NHI on the PCIe switch is itself kind of interesting; it's not a peripheral device so much as a bridge that allows non-PCIe peripherals to speak native Thunderbolt and still get to the CPU memory via PCIe.  For example, Alpine Ridge also has a DisplayPort interface, and it's likely that DisplayPort signals enter the PCIe subsystem through this NHI controller.

Although Alpine Ridge delivers some impressive I/O and connectivity options, it has some pretty critical architectural qualities that limit its overall performance in a desktop.  Notably,

  • Apple recently added support for external GPUs that connect to MacBooks through Thunderbolt 3.  While this sounds really awesome in the sense that you could turn a laptop into a gaming computer on demand, note that the best bandwidth you can get between an external GPU and the system memory is about 4 GB/sec, or the performance of a single PCIe 3.0 4x link.  This pales in comparison to the 16 GB/sec bandwidth available to the AMD Radeon which is directly attached to the CPU's PCIe controller in the iMac.
  • Except in the cases where Thunderbolt-attached peripherals are talking to each other via DMA, they appear to all compete with each other for access to the host memory through the single PCIe 4x upstream link.  4 GB/sec is a lot of bandwidth for most peripherals, but this does mean that an external GPU and a USB 3.1 external SSD or a 4K display will be degrading each others' performance.
In addition, Thunderbolt 3 advertises 40 Gbit/sec performance, but PCIe 3.0 4x only provides 32 Gbit/sec.  Thus, it doesn't look like you can actually get 40 Gbit/sec from Thunderbolt all the way to system memory under any conditions; the peak Thunderbolt performance is only available between Thunderbolt peripherals.

Overall Performance Implications

The way I/O in the iMac is connected definitely introduces a lot of performance bottlenecks that would make this a pretty scary building block for a supercomputer.  The fact that the Alpine Ridge's PCIe switch has a 4:1 taper to the PCH, and the PCH then further tapers all of its peripherals to a single 4x link to the CPU, introduces a lot of cases where performance of one component (for example, the NVMe SSD) can depend on what another device (for example, a USB 3.1 peripheral) is doing.  The only component which does not compromise on performance is the Radeon GPU, which has a direct connection to the CPU and its memory; this is how all I/O devices in typical HPC nodes are connected.

With all that being said, the iMac's I/O subsystem is a great design for its intended use.  It effectively trades peak I/O performance for extreme I/O flexibility; whereas a typical HPC node would ensure enough bandwidth to operate an InfiniBand adapter at full speed while simultaneously transferring data to a GPU, it wouldn't support plugging in a USB 3.1 hard drive or a 4K monitor.

Plugging USB 3 hard drives into an HPC node is surprisingly annoying.  I've had to do this for bioinformaticians, and it involves installing a discrete PCIe USB 3 controller alongside high-bandwidth network controllers.

Curiously, as I/O becomes an increasingly prominent bottleneck in HPC though, we are beginning to see very high-performance and exotic I/O devices entering the market.  For example, IBM's BlueLink  is able to carry a variety of protocols at extreme speeds directly into the CPU, and NVLink over BlueLink is a key technology enabling scaled-out GPU nodes in the OpenPOWER ecosystem.  Similarly, sophisticated PCIe switches are now proliferating to meet the extreme on-node bandwidth requirements of NVMe storage nodes.

Ultimately though, PCH and Thunderbolt aren't positioned well to become HPC technologies.  If nothing else, I hope this breakdown helps illustrate how performance, flexibility, and cost drive the system designs decisions that make desktops quite different from what you'd see in the datacenter.

Appendix: Deciphering the PCIe Topology

Figuring out everything I needed to write this up involved a little bit of anguish.  For the interested reader, here's exactly how I dissected my iMac to figure out how its I/O subsystem was plumbed.

Foremost, I had to boot my iMac into Linux to get access to dmidecode and lspci since I don't actually know how to get at all the detailed device information from macOS.  From this,

ubuntu@ubuntu:~$ lspci -t -v
-[0000:00]-+-00.0  Intel Corporation Device 591f
           +-01.0-[01]--+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480]
           |            \-00.1  Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
           +-14.0  Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
           +-16.0  Intel Corporation Sunrise Point-H CSME HECI #1
           +-17.0  Intel Corporation Sunrise Point-H SATA controller [AHCI mode]
           +-1b.0-[02]----00.0  Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
           +-1c.0-[03]----00.0  Broadcom Limited BCM43602 802.11ac Wireless LAN SoC
           +-1c.1-[04]--+-00.0  Broadcom Limited NetXtreme BCM57766 Gigabit Ethernet PCIe
           |            \-00.1  Broadcom Limited BCM57765/57785 SDXC/MMC Card Reader
...

we see a couple of notable things right away:

  • there's a single PCIe domain, numbered 0000
  • everything branches off of PCIe bus number 00
  • there are a bunch of PCIe bridges hanging off of bus 00 (which connect to bus number 0102, etc)
  • there are a bunch of PCIe devices hanging off both bus 00 and the other buses such as device 0000:00:14 (a USB 3.0 controller) and device 0000:01:00 (the AMD/ATI GPU)
  • at least one device (the GPU) has multiple PCIe functions (0000:01:00.0, a video output, and 0000:01:00.1 an HDMI audio output)

But lspci -t -v actually doesn't list everything that we know about.  For example, we know that there are bridges that connect bus 00 to the other buses, but we need to use lspci -Dv to actually see the information those bridges provides to the OS:

ubuntu@ubuntu:~$ lspci -vD
0000:00:00.0 Host bridge: Intel Corporation Device 591f (rev 05)
DeviceName: SATA
Subsystem: Apple Inc. Device 0180
        ...
0000:00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05) (prog-if 00 [Normal decode])
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
        ...
Kernel driver in use: pcieport
0000:00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31) (prog-if 30 [XHCI])
Subsystem: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller
        ...
Kernel driver in use: xhci_hcd
0000:01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c0) (prog-if 00 [VGA controller])
Subsystem: Apple Inc. Ellesmere [Radeon RX 470/480]
        ...
Kernel driver in use: amdgpu
0000:01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf0
        ...
Kernel driver in use: snd_hda_intel
This tells us more useful information:

  • Device 0000:00:00 is the PCIe host bridge--this is the endpoint that all PCIe devices use to talk to the CPU and, by extension, system memory (since the system memory controller lives on the same on-chip network that the PCIe controller and the CPU cores do)
  • The PCIe bridge connecting bus 00 and bus 01 (0000:00:01) is integrated into the PCIe controller on the CPU.  In addition, the PCI ID for this bridge is the same as the one used on Intel Skylake processors--not surprising, since Kaby Lake is an optimization (not re-architecture) of Skylake.
  • The two PCIe functions on the GPU--0000:01:00.0 and 0000:01:00.1--are indeed a video interface (as evidenced by the amdgpu driver) and an audio interface (snd_hda_intel driver).  Their bus id (01) also indicates that they are directly attached to the Kaby Lake processor's PCIe controller--and therefore enjoy the lowest latency and highest bandwidth available to system memory.
Finally, the Linux kernel's procfs interface provides a very straightforward view of every PCIe device's connectivity by presenting them as symlinks:

ubuntu@ubuntu:/sys/bus/pci/devices$ ls -l
... 0000:00:00.0 -> ../../../devices/pci0000:00/0000:00:00.0
... 0000:00:01.0 -> ../../../devices/pci0000:00/0000:00:01.0
... 0000:00:14.0 -> ../../../devices/pci0000:00/0000:00:14.0
... 0000:00:16.0 -> ../../../devices/pci0000:00/0000:00:16.0
... 0000:00:17.0 -> ../../../devices/pci0000:00/0000:00:17.0
... 0000:00:1b.0 -> ../../../devices/pci0000:00/0000:00:1b.0
... 0000:00:1c.0 -> ../../../devices/pci0000:00/0000:00:1c.0
... 0000:00:1c.1 -> ../../../devices/pci0000:00/0000:00:1c.1
... 0000:00:1c.4 -> ../../../devices/pci0000:00/0000:00:1c.4
... 0000:00:1f.0 -> ../../../devices/pci0000:00/0000:00:1f.0
... 0000:00:1f.2 -> ../../../devices/pci0000:00/0000:00:1f.2
... 0000:00:1f.3 -> ../../../devices/pci0000:00/0000:00:1f.3
... 0000:00:1f.4 -> ../../../devices/pci0000:00/0000:00:1f.4
... 0000:01:00.0 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.0
... 0000:01:00.1 -> ../../../devices/pci0000:00/0000:00:01.0/0000:01:00.1
... 0000:02:00.0 -> ../../../devices/pci0000:00/0000:00:1b.0/0000:02:00.0
... 0000:03:00.0 -> ../../../devices/pci0000:00/0000:00:1c.0/0000:03:00.0
... 0000:04:00.0 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.0
... 0000:04:00.1 -> ../../../devices/pci0000:00/0000:00:1c.1/0000:04:00.1
... 0000:05:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0
... 0000:06:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0
... 0000:06:01.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:01.0
... 0000:06:02.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0
... 0000:06:04.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:04.0
... 0000:07:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:00.0/0000:07:00.0
... 0000:08:00.0 -> ../../../devices/pci0000:00/0000:00:1c.4/0000:05:00.0/0000:06:02.0/0000:08:00.0

This topology, combined with the lspci outputs above, reveals that most of the I/O peripherals are either directly provided by or hang off of the Sunrise Point chip.  There is another fan-out of PCIe ports hanging off of the Alpine Ridge chip (0000:00:1b.0 and 0000:00:1c.{0,1,4}), and what's not shown are the Native Thunderbolt (NHI) connections, such as DisplayPort, on the other side of the Alpine Ridge.  Although I haven't looked very hard, I did not find a way to enumerate these Thunderbolt NHI devices.

There remain a few other open mysteries to me as well; for example, lspci -vv reveals the PCIe lane width of most PCIe-attached devices, but it does not obviously display the maximum lane width for each connection.  Furthermore, the USB, HECI, SATA, and LPC bridges hanging off the Sunrise Point do not list a lane width at all, so I still don't know exactly what level of bandwidth is available to these bridges.

If anyone knows more about how to peel back the onion on some of these bridges, or if I'm missing any important I/O connections between the CPU, PCH, or Alpine Ridge that are not enumerated via PCIe, please do let me know!  I'd love to share the knowledge and make this more accurate if possible.