Enterprise


OpenCAPI Unveiled: AMD, IBM, Google, Xilinx, Micron and Mellanox Join Forces in the Heterogenous Computing Era

OpenCAPI Unveiled: AMD, IBM, Google, Xilinx, Micron and Mellanox Join Forces in the Heterogenous Computing Era

Some of you may remember AMD announcing the “Torrenza” technology 10 years ago. The idea was to offer a fast and coherent interface between the CPU and various types of “accelerators” (via Hyper Transport). It was one of the first initiatives to enable “heterogeneous computing”.

We now have technology that could be labeled “heterogeneous computing”, the most popular form being GPU computing. There have been also encryption, compression and network accelerators, but the advantages of those accelerators were never really clear, as shifting data back and forth to the CPU was in many cases less efficient than letting the CPU process it with optimized instructions. Heterogeneous computing was in the professional world mostly limited to HPC; in the consumer world a “nice to have”.

But times are changing. The sensors of the Internet of Things, the semantic web and the good old www are creating a massive and exponentially growing flood of data that can not be stored and analyzed by traditional means. Machine learning offers a way of classifying all that data and finding patterns “automatically”. As a result, we witnessed a “machine learning renaissance”, with quite a few breakthroughs. Google had to deal with this years ago before most other companies, and released some of those AI breakthroughs of the Google Brain Team in the Open Source world, one example being “TensorFlow”. And when Google releases important technology into the Open Source world, we know we got to pay attention. When Google released the Google File System and Big Table back in 2004 for example, a little bit later the big data revolution with Hadoop, HDFS and NoSQL databases erupted.

Big Data thus needs big brains: we need more processing power than ever. As Moore’s law is dead (the end of CMOS scaling), we can not expect much from process technology advancements. The processing power has to come from ASICs (see Google’s TPU), FPGAs (see Microsoft’s project Catapult) and GPUs.

Those accelerators need a new “Torrenza technology”, a fast, coherent interconnect to the CPU. NVIDIA was first with NVLink, but an open standard would be even better. IBM on the other hand was willing to share the CAPI interface.

To that end, Google, AMD, Xilinx, Micron and Mellanox have joined forces with IBM to create a “coherent high performance bus interface” based on a new bus standard called “Open Coherent Accelerator Processor Interface” (OpenCAPI). Capable of a 25Gbits per second per lane data rate, OpenCAPI outperforms the current PCIe specification, which offers a maximum data transfer rate of 8Gbits per second for a PCIe 3.0 lane. We assume that the total bandwidth will be a lot higher for quite a few OpenCAPI devices, as OpenCAPI lanes will be bundled together.

It is a win, win for everybody besides Intel. It is clear now that IBM’s OpenPOWER initiative is gaining a lot of traction and that IBM is deadly serious about offering an alternative to the Intel dominated datacenter. IBM will implement the OpenCAPI interface in the POWER9 servers in 2017. Those POWER9s will not only have a very fast interface to NVIDIA GPUs (via NVLink), but also to Google’s ASICs and Xilinx FPGAs accelerators.

Meanwhile this benefits AMD as they get access to an NVLink alternative to link up the Radeon GPU power to the upcoming Zen based server processors. Micron can link faster (and more profitable than DRAM) memory to the CPU. Mellanox can do the same for networking. OpenCAPI is even more important for the Xilinx FPGAs as a coherent interface can make FPGAs attractive for a much wider range of applications than today.

And guess what, Dell/EMC has joined this new alliance just a few days ago. Intel has to come up with an answer…

Update: courtesy of commenter Yojimbo: NVIDIA is a member of the OpenCAPI consortium, at the “contributor level”, which is the same level Xilinx has. The same is true for HPE (HP Enterprise)”.

This is even bigger than we thought. Probably the biggest announcement in the server market this year.

 

OpenCAPI Unveiled: AMD, IBM, Google, Xilinx, Micron and Mellanox Join Forces in the Heterogenous Computing Era

OpenCAPI Unveiled: AMD, IBM, Google, Xilinx, Micron and Mellanox Join Forces in the Heterogenous Computing Era

Some of you may remember AMD announcing the “Torrenza” technology 10 years ago. The idea was to offer a fast and coherent interface between the CPU and various types of “accelerators” (via Hyper Transport). It was one of the first initiatives to enable “heterogeneous computing”.

We now have technology that could be labeled “heterogeneous computing”, the most popular form being GPU computing. There have been also encryption, compression and network accelerators, but the advantages of those accelerators were never really clear, as shifting data back and forth to the CPU was in many cases less efficient than letting the CPU process it with optimized instructions. Heterogeneous computing was in the professional world mostly limited to HPC; in the consumer world a “nice to have”.

But times are changing. The sensors of the Internet of Things, the semantic web and the good old www are creating a massive and exponentially growing flood of data that can not be stored and analyzed by traditional means. Machine learning offers a way of classifying all that data and finding patterns “automatically”. As a result, we witnessed a “machine learning renaissance”, with quite a few breakthroughs. Google had to deal with this years ago before most other companies, and released some of those AI breakthroughs of the Google Brain Team in the Open Source world, one example being “TensorFlow”. And when Google releases important technology into the Open Source world, we know we got to pay attention. When Google released the Google File System and Big Table back in 2004 for example, a little bit later the big data revolution with Hadoop, HDFS and NoSQL databases erupted.

Big Data thus needs big brains: we need more processing power than ever. As Moore’s law is dead (the end of CMOS scaling), we can not expect much from process technology advancements. The processing power has to come from ASICs (see Google’s TPU), FPGAs (see Microsoft’s project Catapult) and GPUs.

Those accelerators need a new “Torrenza technology”, a fast, coherent interconnect to the CPU. NVIDIA was first with NVLink, but an open standard would be even better. IBM on the other hand was willing to share the CAPI interface.

To that end, Google, AMD, Xilinx, Micron and Mellanox have joined forces with IBM to create a “coherent high performance bus interface” based on a new bus standard called “Open Coherent Accelerator Processor Interface” (OpenCAPI). Capable of a 25Gbits per second per lane data rate, OpenCAPI outperforms the current PCIe specification, which offers a maximum data transfer rate of 8Gbits per second for a PCIe 3.0 lane. We assume that the total bandwidth will be a lot higher for quite a few OpenCAPI devices, as OpenCAPI lanes will be bundled together.

It is a win, win for everybody besides Intel. It is clear now that IBM’s OpenPOWER initiative is gaining a lot of traction and that IBM is deadly serious about offering an alternative to the Intel dominated datacenter. IBM will implement the OpenCAPI interface in the POWER9 servers in 2017. Those POWER9s will not only have a very fast interface to NVIDIA GPUs (via NVLink), but also to Google’s ASICs and Xilinx FPGAs accelerators.

Meanwhile this benefits AMD as they get access to an NVLink alternative to link up the Radeon GPU power to the upcoming Zen based server processors. Micron can link faster (and more profitable than DRAM) memory to the CPU. Mellanox can do the same for networking. OpenCAPI is even more important for the Xilinx FPGAs as a coherent interface can make FPGAs attractive for a much wider range of applications than today.

And guess what, Dell/EMC has joined this new alliance just a few days ago. Intel has to come up with an answer…

Update: courtesy of commenter Yojimbo: NVIDIA is a member of the OpenCAPI consortium, at the “contributor level”, which is the same level Xilinx has. The same is true for HPE (HP Enterprise)”.

This is even bigger than we thought. Probably the biggest announcement in the server market this year.

 

New ARM IP Launched: CMN-600 Interconnect for 128 Cores and DMC-620, an 8Ch DDR4 IMC

New ARM IP Launched: CMN-600 Interconnect for 128 Cores and DMC-620, an 8Ch DDR4 IMC

You need much more than a good CPU core to conquer the server world. As more cores are added, the way data moves from one part of the silicon to another gets more important. ARM has announced today a new and faster member to their SoC interconnect IP offerings in the form of CMN-600 (CMN stands for ‘coherent mesh network’, as opposed to cache coherent network of CCN). This is a direct update to CCN-500 series, which we’ve discussed at AnandTech before. 

The idea behind a coherent mesh between cores as it stands in the ARM Server SoC space is that you can put a number of CPU clusters (e.g. four lots of 4xA53) and accelerators (custom or other IP) into one piece of silicon. Each part of the SoC has to work with everything else, and for that ARM offers a variety of interconnect licences for users who want to choose from ARM’s IP range. For ARM licensees who pick multiple ARM parts, this makes it easier for to combine high core counts and accelerators in one large SoC.  

The previous generation interconnect, the CCN-512, could support 12 clusters of 4 cores and maintain coherency, allowing for large 48-core chips. The new CMN-600 can support up to 128 cores (32 clusters of 4). As part of the announcement, There is also an agile system cache which a way for I/O devices to allocate memory and cache lines directly into the L3, reducing the latency of I/O without having to touch the core.

Also in the announcement is a new memory controller. The old DMC-520, which was limited to four channels of DDR3, is being superseded by the DMC-620 controller which supports eight channels of DDR4. Each DMC-620 channel can contain up to 1 TB DDR4, giving a potential SoC support of 8TB. 

According to ARM through simulations, the improved memory controller offers 50% lower latency and up to 5 times more bandwidth. Also, the new DMC is being advertised as supporting DDR4-3200. 3200 MT/s offers twice as much bandwidth than 1600 MT/s, and doubling the channels offers twice the amount of bandwidth – so we can explain 4 times more bandwidth, so it is interesting that ARM claims 5x more, which would suggest efficiency improvements as well. 

If you double the number of cores and memory controllers, you expect twice as much performance in the almost perfectly scaling SPEC int2006_rate. ARM claims that their simulations show that 64 A72s will run 2.5 times faster than 32 A72 cores, courtesy of the improved memory controller. If true, that is quite impressive. By comparison, we did not see such a jump in performance in the Xeon world when DDR3 was replaced by DDR4. Even more impressive is the claim that the maximum compute performance of a 64x A72 SoC can go up by a factor six compared to 16x A57 variant. But we must note that the A57 was not exactly a success in the server world: so far only AMD has cooked up a server SoC with it and it was slower and more power hungry than the much older Atom C2000.   

We have little doubt we will find the new CMN-600 and/or DMC-620 in many server solutions. The big question will be one of application: who will use this interconnect technology in their server SoCs? As most licensees do not disclose this information, it is hard to find out. As far as we know, Cavium uses its own interconnect technology, which would suggest Qualcomm or Avago/Broadcom are the most likely candidates. 

New ARM IP Launched: CMN-600 Interconnect for 128 Cores and DMC-620, an 8Ch DDR4 IMC

New ARM IP Launched: CMN-600 Interconnect for 128 Cores and DMC-620, an 8Ch DDR4 IMC

You need much more than a good CPU core to conquer the server world. As more cores are added, the way data moves from one part of the silicon to another gets more important. ARM has announced today a new and faster member to their SoC interconnect IP offerings in the form of CMN-600 (CMN stands for ‘coherent mesh network’, as opposed to cache coherent network of CCN). This is a direct update to CCN-500 series, which we’ve discussed at AnandTech before. 

The idea behind a coherent mesh between cores as it stands in the ARM Server SoC space is that you can put a number of CPU clusters (e.g. four lots of 4xA53) and accelerators (custom or other IP) into one piece of silicon. Each part of the SoC has to work with everything else, and for that ARM offers a variety of interconnect licences for users who want to choose from ARM’s IP range. For ARM licensees who pick multiple ARM parts, this makes it easier for to combine high core counts and accelerators in one large SoC.  

The previous generation interconnect, the CCN-512, could support 12 clusters of 4 cores and maintain coherency, allowing for large 48-core chips. The new CMN-600 can support up to 128 cores (32 clusters of 4). As part of the announcement, There is also an agile system cache which a way for I/O devices to allocate memory and cache lines directly into the L3, reducing the latency of I/O without having to touch the core.

Also in the announcement is a new memory controller. The old DMC-520, which was limited to four channels of DDR3, is being superseded by the DMC-620 controller which supports eight channels of DDR4. Each DMC-620 channel can contain up to 1 TB DDR4, giving a potential SoC support of 8TB. 

According to ARM through simulations, the improved memory controller offers 50% lower latency and up to 5 times more bandwidth. Also, the new DMC is being advertised as supporting DDR4-3200. 3200 MT/s offers twice as much bandwidth than 1600 MT/s, and doubling the channels offers twice the amount of bandwidth – so we can explain 4 times more bandwidth, so it is interesting that ARM claims 5x more, which would suggest efficiency improvements as well. 

If you double the number of cores and memory controllers, you expect twice as much performance in the almost perfectly scaling SPEC int2006_rate. ARM claims that their simulations show that 64 A72s will run 2.5 times faster than 32 A72 cores, courtesy of the improved memory controller. If true, that is quite impressive. By comparison, we did not see such a jump in performance in the Xeon world when DDR3 was replaced by DDR4. Even more impressive is the claim that the maximum compute performance of a 64x A72 SoC can go up by a factor six compared to 16x A57 variant. But we must note that the A57 was not exactly a success in the server world: so far only AMD has cooked up a server SoC with it and it was slower and more power hungry than the much older Atom C2000.   

We have little doubt we will find the new CMN-600 and/or DMC-620 in many server solutions. The big question will be one of application: who will use this interconnect technology in their server SoCs? As most licensees do not disclose this information, it is hard to find out. As far as we know, Cavium uses its own interconnect technology, which would suggest Qualcomm or Avago/Broadcom are the most likely candidates. 

Xilinx Launches Cost-Optimized Portfolio: New Spartan, Artix and Zynq Solutions

Xilinx Launches Cost-Optimized Portfolio: New Spartan, Artix and Zynq Solutions

Some of the key elements of the embedded market are cost, power and efficiency. A number of applications for embedded vision and IoT, when applying complexity, rely on the addition of additional fixed function of variable function hardware to accelerate throughput. To that end, the product has multiple FPGA/SoC devices to achieve the goal. The FPGA market is large, however Xilinx is in the process of redefining their product ecosystem to include SoCs with FPGAs built in: silicon with both general purpose ARM processors (Cortex-A series) and programmable logic gates to deal with algorithm acceleration, especially when it comes to sensor fusion/programmable IO and low-cost devices.  As a result, Xilinx is releasing a new single-core Zynq 7000 series SoC with an embedded FPGA, as well as new Spartan-7 and Artix-7 FPGAs focused on low cost.

The new Spartan-7, built on 28nm TSMC and measuring 8x8mm, is aimed at sensor fusion and connectivity, while the Artix-7 is for the signal processing market. The Zynq-7000 is for the programmable SoC space, featuring a single ARM Core (various models starting with Cortex-A9, moving up to dual/quad Cortex-A53) allowing for on-chip analytical functions as well as host-less driven implementations and bitstream encryption. The all-in-one SoC with onboard FPGA adds a benefit in bringing a floorplan design of multiple chips down from three to one, with the potential to reduce power and offer improved security by keeping the interchip connections on silicon. While the Zynq family isn’t new, the 7000 series for this announcement is aimed squarely at embedded, integrated and industrial IoT platforms by providing a cost-optimized solution.

We spoke with Xilinx’s Steve Glaser, SVP of Corporate Strategy, who explained that Xilinx wants to be in the prime position to tackle four key areas: Cloud, Embedded Vision, Industrial IoT and 5G. The use in the cloud is indicative of high focused workloads that land between ASICs and general purpose compute, but also for networking (Infiniband) and storage, with Xilinx IP in interfaces, SSD controllers, smart networking, video conversion and big data. This coincides with the announcement of the CCIX Consortium for coherent interconnects in accelerators.

Embedded Vision is a big part of the future vision of Xilinx, particularly in automotive and ADAS systems. Part of this involves machine learning, and the ability to apply different implementations on programmable silicon as the algorithms adapt and change over time. Xilinx cites a considerable performance and efficiency benefit over GPU solutions, and a wider range of applicability over fixed function hardware.

Industrial IoT (I-IoT) spans medical, factory, surveillance, robotics, transportation, and other typical industry verticals where monitoring and programmability go hand-in-hand. Steve Glaser cited that Xilinx has an 80% market share in I-IoT penetration, quoting billions of dollars in savings industry wide for very small efficiency gains on the production line.

One thing worth noting that FPGA and mixed SoC/FPGA implementations require substantial software on top to operate effectively. Xilinx plays in all the major computer vision and neural network implementations, and we were told with an aim to streamline compilation with simple pragmas that identify code structures for FPGA implementation. This is where the mixed SoC/FPGA implementations, we are told, work best, allowing the analytics on the ARM cores to adjust the FPGA on the fly as required depending on sensor input or algorithm adjustment.

Xilinx sits in that position as being a hardware provider for a solution, but not the end-goal solution provider, if that makes sense. Their customers are the ones that implement what we see in automotive or industrial, so they typically discuss their hardware at a very general level but it still requires an understanding of the markets they focus on to discuss which applications may benefit from FPGA or mixed SoC/FPGA implementations. When visiting any event about IoT or compute as a journalist, there always involves some discussion around FPGA implementation and that transition from hardware to software to product. Xilinx is confident about their position in the FPGA market, and Intel’s acquisition of Altera has the markets where both companies used to compete has raised a lot of questions about FPGA roadmaps, with a number of big customers now willing to work on both sides of the fence to keep their options open.

On the new cost-optimized chip announcement, the Spartan-7, Artix-7 and Zynq-7000 will be enabled in the 2016.3 release of the Vivado Design Suite and Xilinx SDx environments later this year, with production devices shipping in Q1 2017.