Vik


Razer Updates The DeathAdder Elite Gaming Mouse

Razer Updates The DeathAdder Elite Gaming Mouse

Although Razer has become one of the well known gaming computer companies, they got their start with gaming mice, and today Razer is launching their next iteration of the best selling gaming mouse of all time, the Razer DeathAdder Elite. The DeathAdde…

CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6

CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6

Deep learning, neural networks and image/vision processing is already a large field, however many of the applications that rely on it are still in their infancy. Automotive is the prime example that uses all of these areas, and solutions to the automotive ‘problem’ are require significant understanding and development in both hardware and software – the ability to process data with high accuracy in real-time opens up a number of doors for other machine learning codes, and all that comes afterwards is cost and power. The CEVA-XM4 DSP was aimed at being the first programmable DSP to support deep learning, and the new XM6 IP (along with the software ecosystem) is being launched today under the heading of stronger efficiency, compute, and new patents regarding power saving features.

Playing the IP Game

When CEVA launched the XM4 DSP, with the ability to infer pre-trained algorithms in fixed-point math to a similar (~1%) accuracy as the full algorithms, it won a number of awards from analysts in the field, claiming high performance and power efficiency over competing solutions and the initial progression for a software framework. The IP announcement was back in Q1 2015, with licensees coming on board over the next year and the first production silicon using the IP rolling off the line this year. Since then, CEVA has announced its CDNN2 platform, a one-button compilation tool for trained networks to be converted into suitable code for CEVA’s XM IPs. The new XM6 integrates the previous XM4 features, with improved configurations, access to hardware accelerators, new hardware accelerators, and still retains compatibility with the CDNN2 platform such that code suitable for XM4 can be run on XM6 with improved performance.

CEVA is in the IP business, like ARM, and works with semiconductor licensees that then sell to OEMs. This typically results in a long time-to-market, especially when industries such as security and automotive are moving at a rapid pace. CEVA is promoting the XM6 as a scalable, programmable DSP that can scale across markets with a single code base, while also using additional features to improve power, cost and performance.

The announcement today covers the new XM6 DSP, CEVA’s new set of imaging and vision software libraries, a set of new hardware accelerators and integration into the CDNN2 ecosystem. CDNN2 is a one-button compilation tool, detecting convolution and applying the best methodology for data transfer over the logic blocks and accelerators.

XM6 will support OpenCL and C++ development tools, and the software elements include CEVA’s computer vision, neural network and vision processing libraries with third-party tools as well. The hardware implements an AXI interconnect for the processing parts of the standard XM6 core to interact with the accelerators and memory. Along with the XM6 IP, there are hardware accelerators for convolution (CDNN assistance) allowing lower power fixed function hardware to cope with difficult parts of neural network systems such as GoogleNet, De-Warp for adjusting images taken on fish-eye or distorted lenses (once the distortion of an image is known, the math for the transform is fixed-function friendly), as well as other third party hardware accelerators.

The XM6 promotes two new specific hardware features that will aid the majority of image processing and machine learning algorithms. The first is scatter-gather, or the ability to read values from 32 addresses in L1 cache into vector registers in a single cycle. The CDNN2 compilation tool identifies serial code loading and implements vectorization to allow this feature, and scatter-gather improves data loading time when the data required is distributed through the memory structure. As the XM6 is configurable IP, the size/associativity of the L1 data store is adjustable at the silicon design level, and CEVA has stated that this feature will work with any size L1. The vector registers for processing at this level are 8-wide VLIW implementations, meaning ‘feed the beast’ is even more important than usual.

The second feature is called ‘sliding-window’ data processing, and this specific technique for vision processing has been patented by CEVA. There are many ways to process an image for either processing or intelligence, and typically an algorithm will use a block or tile of pixels at once to perform what it needs to. For the intelligence part, a number of these blocks will overlap, resulting in areas of the image being reused at different parts of the computation. CEVA’s method is to retain that data, resulting in fewer bits being needed in the next step of analysis. If this sounds straightforward (I was doing something similar with 3D differential equation analysis back in 2009), it is, and I was surprised that it had not been implemented in vision/image processing before. Reusing old data (assuming you have somewhere to store it) saves time and saves energy.

CEVA is claiming up to a 3x performance gain in heavy vector workloads for XM6 over XM4, with an average of 2x improvement for like-for-like ported kernels. The XM6 is also more configurable than the XM4 from a code perspective, offering ‘50% more control’.

With the specific CDNN hardware accelerator (HWA), CEVA cites that convolution layers in ecosystems such as GoogleNet consume the majority of cycles. The CDNN HWA takes this code and implements fixed hardware for it with 512 MACs using 16-bit support for up to an 8x performance gain (and 95% utilization). CEVA mentioned that a 12-bit implementation would save die area and cost for a minimal reduction in accuracy, however there are a number of developers requesting full 16-bit support for future projects, hence the choice.

Two of the big competitors for CEVA in this space, for automotive image/visual processing, is MobilEye and NVIDIA, with the latter promoting the TX1 for both training and inference for neural networks. Based on TX1 on a TSMC 20nm Planar process at 690 MHz, CEVA states that their internal simulations give a single XM6 based platform as 25x the efficiency and 4x the speed based on AlexNet and GoogleNet (with the XM6 also at 20nm, even though it will most likely be implemented at 16nm FinFET or 28nm). This would mean, extrapolating the single batch TX1 data published, that XM6 using AlexNet at FP16 can perform 268 images a second compared to 67, at around 800 mW compared to 5.1W. At 16FF, this power number is likely to be significantly lower (CEVA told us that their internal metrics were initially done at 28nm/16FF, but were redone on 20nm for an apples-to-apples with the TX1). It should be noted that TX1 numbers were provided for multi-batch which offered better efficiency over single batch, however other comparison numbers were not provided. CEVA also implements power gating with a DVFS scheme that allows low power modes when various parts of the DSP or accelerators are idle.

Obviously the advantage that NVIDIA has with their solution is availability and CUDA/OpenCL software development, both of which CEVA is attempting to address with one-button software platforms like CDNN2 and improved hardware such as XM6. It will be interesting to see which semiconductor partners and future implementations will combine this image processing with machine learning in the future. CEVA states that smartphones, automotive, security and commercial (drones, automation) applications are prime targets.

CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6

CEVA Launches Fifth-Generation Machine Learning Image and Vision DSP Solution: CEVA-XM6

Deep learning, neural networks and image/vision processing is already a large field, however many of the applications that rely on it are still in their infancy. Automotive is the prime example that uses all of these areas, and solutions to the automotive ‘problem’ are require significant understanding and development in both hardware and software – the ability to process data with high accuracy in real-time opens up a number of doors for other machine learning codes, and all that comes afterwards is cost and power. The CEVA-XM4 DSP was aimed at being the first programmable DSP to support deep learning, and the new XM6 IP (along with the software ecosystem) is being launched today under the heading of stronger efficiency, compute, and new patents regarding power saving features.

Playing the IP Game

When CEVA launched the XM4 DSP, with the ability to infer pre-trained algorithms in fixed-point math to a similar (~1%) accuracy as the full algorithms, it won a number of awards from analysts in the field, claiming high performance and power efficiency over competing solutions and the initial progression for a software framework. The IP announcement was back in Q1 2015, with licensees coming on board over the next year and the first production silicon using the IP rolling off the line this year. Since then, CEVA has announced its CDNN2 platform, a one-button compilation tool for trained networks to be converted into suitable code for CEVA’s XM IPs. The new XM6 integrates the previous XM4 features, with improved configurations, access to hardware accelerators, new hardware accelerators, and still retains compatibility with the CDNN2 platform such that code suitable for XM4 can be run on XM6 with improved performance.

CEVA is in the IP business, like ARM, and works with semiconductor licensees that then sell to OEMs. This typically results in a long time-to-market, especially when industries such as security and automotive are moving at a rapid pace. CEVA is promoting the XM6 as a scalable, programmable DSP that can scale across markets with a single code base, while also using additional features to improve power, cost and performance.

The announcement today covers the new XM6 DSP, CEVA’s new set of imaging and vision software libraries, a set of new hardware accelerators and integration into the CDNN2 ecosystem. CDNN2 is a one-button compilation tool, detecting convolution and applying the best methodology for data transfer over the logic blocks and accelerators.

XM6 will support OpenCL and C++ development tools, and the software elements include CEVA’s computer vision, neural network and vision processing libraries with third-party tools as well. The hardware implements an AXI interconnect for the processing parts of the standard XM6 core to interact with the accelerators and memory. Along with the XM6 IP, there are hardware accelerators for convolution (CDNN assistance) allowing lower power fixed function hardware to cope with difficult parts of neural network systems such as GoogleNet, De-Warp for adjusting images taken on fish-eye or distorted lenses (once the distortion of an image is known, the math for the transform is fixed-function friendly), as well as other third party hardware accelerators.

The XM6 promotes two new specific hardware features that will aid the majority of image processing and machine learning algorithms. The first is scatter-gather, or the ability to read values from 32 addresses in L1 cache into vector registers in a single cycle. The CDNN2 compilation tool identifies serial code loading and implements vectorization to allow this feature, and scatter-gather improves data loading time when the data required is distributed through the memory structure. As the XM6 is configurable IP, the size/associativity of the L1 data store is adjustable at the silicon design level, and CEVA has stated that this feature will work with any size L1. The vector registers for processing at this level are 8-wide VLIW implementations, meaning ‘feed the beast’ is even more important than usual.

The second feature is called ‘sliding-window’ data processing, and this specific technique for vision processing has been patented by CEVA. There are many ways to process an image for either processing or intelligence, and typically an algorithm will use a block or tile of pixels at once to perform what it needs to. For the intelligence part, a number of these blocks will overlap, resulting in areas of the image being reused at different parts of the computation. CEVA’s method is to retain that data, resulting in fewer bits being needed in the next step of analysis. If this sounds straightforward (I was doing something similar with 3D differential equation analysis back in 2009), it is, and I was surprised that it had not been implemented in vision/image processing before. Reusing old data (assuming you have somewhere to store it) saves time and saves energy.

CEVA is claiming up to a 3x performance gain in heavy vector workloads for XM6 over XM4, with an average of 2x improvement for like-for-like ported kernels. The XM6 is also more configurable than the XM4 from a code perspective, offering ‘50% more control’.

With the specific CDNN hardware accelerator (HWA), CEVA cites that convolution layers in ecosystems such as GoogleNet consume the majority of cycles. The CDNN HWA takes this code and implements fixed hardware for it with 512 MACs using 16-bit support for up to an 8x performance gain (and 95% utilization). CEVA mentioned that a 12-bit implementation would save die area and cost for a minimal reduction in accuracy, however there are a number of developers requesting full 16-bit support for future projects, hence the choice.

Two of the big competitors for CEVA in this space, for automotive image/visual processing, is MobilEye and NVIDIA, with the latter promoting the TX1 for both training and inference for neural networks. Based on TX1 on a TSMC 20nm Planar process at 690 MHz, CEVA states that their internal simulations give a single XM6 based platform as 25x the efficiency and 4x the speed based on AlexNet and GoogleNet (with the XM6 also at 20nm, even though it will most likely be implemented at 16nm FinFET or 28nm). This would mean, extrapolating the single batch TX1 data published, that XM6 using AlexNet at FP16 can perform 268 images a second compared to 67, at around 800 mW compared to 5.1W. At 16FF, this power number is likely to be significantly lower (CEVA told us that their internal metrics were initially done at 28nm/16FF, but were redone on 20nm for an apples-to-apples with the TX1). It should be noted that TX1 numbers were provided for multi-batch which offered better efficiency over single batch, however other comparison numbers were not provided. CEVA also implements power gating with a DVFS scheme that allows low power modes when various parts of the DSP or accelerators are idle.

Obviously the advantage that NVIDIA has with their solution is availability and CUDA/OpenCL software development, both of which CEVA is attempting to address with one-button software platforms like CDNN2 and improved hardware such as XM6. It will be interesting to see which semiconductor partners and future implementations will combine this image processing with machine learning in the future. CEVA states that smartphones, automotive, security and commercial (drones, automation) applications are prime targets.

AMD Announces Embedded Radeon E9260 & E9550 - Polaris for Embedded Markets

AMD Announces Embedded Radeon E9260 & E9550 – Polaris for Embedded Markets

While it’s AMD’s consumer products that get the most fanfare with new GPU launches – and rightfully so – AMD and their Radeon brand also have a solid (if quiet) business in the discrete embedded market. Here, system designers utilize discrete video cards for commercial, all-in one products. And while the technology is much the same as on the consumer side, the use cases differ, as do the support requirements. For that reason, AMD offers a separate lineup of products just for this market under the Radeon Embedded moniker.

Now that we’ve seen AMD’s new Polaris architecture launch in the consumer world, AMD is taking the next step by refreshing the Radeon Embedded product lineup to use these new parts. To that end, this morning AMD is announcing two new Radeon Embedded video cards: the E9260 and the E9550. Based on the Polaris 11 and Polaris 10 GPUs respectively, these parts are updating the “high performance” and “ultra-high performance” segments of AMD’s embedded offerings.

AMD Embedded Radeon Discrete Video Cards
  Radeon E9550 Radeon E9260 Radeon E8950 Radeon E8870
Stream Processors 2304 896 2048 768
GPU Base Clock 1.12GHz ? 750MHz 1000MHz
GPU Boost Clock ~1.26GHz ~1.4GHz N/A N/A
Memory Clock 7Gbps GDDR5 7Gbps GDDR5? 6Gbps GDDR5 6Gbps GDDR5
Memory Bus Width 256-bit 128-bit 256-bit 128-bit
VRAM 8GB 4GB 8GB 4GB
Displays 6 5 6 6
TDP Up To 95W Up To 50W 95W 75W
GPU Polaris 10 Polaris 11 Tonga Bonaire
Architecture GCN 4 GCN 4 GCN 1.2 GCN 1.1
Form Factor MXM MXM & PCIe MXM MXM & PCIe

We’ll start things off with the Embedded Radeon E9550, which is the new top-performance card in AMD’s embedded lineup. Based on AMD’s Polaris 10 GPU, this is essentially an embedded version of the consumer Radeon RX 480, offering the same number of SPs at roughly the same clockspeed. This part supersedes the last-generation E8950, which is based on AMD’s Tonga GPU, and is rated to offer around 93% better performance, thanks to the slightly wider GPU and generous clockspeed bump.

The E9550 is offered in a single design, an MXM Type-B card that’s rated for 95W. These embedded-class MXM cards are typically based on AMD’s mobile consumer designs, and while I don’t have proper photos for comparison – AMD’s supplied photos are stock photos of older products – I’m sure it’s the same story here. Otherwise, the card is outfitted with 8GB of GDDR5, like the E8950 before it, and is capable of driving up to 6 displays. Finally, AMD will be offering the card for sale for 3 years, which again is par for the course here for AMD.

Following up behind the E9550 is the E9260, the next step down in the refreshed Embedded Radeon lineup. This card is based on AMD’s Polaris 11 GPU, and is similar to the consumer Radeon RX 460, meaning it’s not quite a fully enabled GPU. Within AMD’s lineup it replaces the E8870, offering 2.5 TFLOPS of single precision floating point performance to the former’s 1.5 TFLOPS. AMD doesn’t list official clockspeeds for this card, but based on the throughput rating this puts its boost clock at around 1.4GHz. The card is paired with 4GB of GDDR5 on a 128-bit bus.

Meanwhile on the power front, the E9260 is being rated for up to 50W. Notably, this is down from the 75W designation of its predecessor, as the underlying Polaris 11 GPU aims for lower power consumption. And unlike its more powerful sibling, the E9260 is being offered in two form factors: an MXM Type-A card, and a half height half length (HHHL) PCIe card. Both cards have identical performance specifications, differing only in their form factor and display options. Both cards can support up to 5 displays, though the PCIe card only has 4 physical outputs (so you’d technically need an MST hub for the 5th). Finally, both versions of the card will be offered by AMD for 5 years, which at this point would mean through 2021.

Moving on, besides the immediate performance benefits of Polaris, AMD is also looking to leverage Polaris’s updated display controller and multimedia capabilities for the embedded market. Of particular note here is support for full H.265 video encoding and decoding, something the previous generation products lacked. And display connectivity is greatly improved too, with both HDMI 2.0 support and DisplayPort 1.3/1.4 support.

The immediate market for these cards will be the same general markets that previous generation products have been pitched at, including digital signage, casino gaming, and medical, all of whom make use of GPUs in various degrees and need parts to be available for a defined period of time. Across all of these markets AMD is especially playing up the 4K and HDR capabilities of the new cards, along of course with overall improved performance.

At the same time however, AMD’s embedded group is also looking towards the future, trying to encourage customers to make better use of their GPUs for compute tasks, a market AMD considers to be in its infancy. This includes automated image analysis/diagnosis, machine learning inferencing to allow a casino machine or digital sign to react to a customer, and GPU beamforming for medical. And of course, AMD always has an eye on VR and AR, though for the embedded market in particular that’s going to be more off the beaten path.

Wrapping things up, AMD tells us that the new Embedded Radeon cards will be shipping in the next quarter. The E9260 will be shipping in production in the next couple of weeks, while the E9550 will be coming towards the end of Q4.

AMD Announces Embedded Radeon E9260 & E9550 - Polaris for Embedded Markets

AMD Announces Embedded Radeon E9260 & E9550 – Polaris for Embedded Markets

While it’s AMD’s consumer products that get the most fanfare with new GPU launches – and rightfully so – AMD and their Radeon brand also have a solid (if quiet) business in the discrete embedded market. Here, system designers utilize discrete video cards for commercial, all-in one products. And while the technology is much the same as on the consumer side, the use cases differ, as do the support requirements. For that reason, AMD offers a separate lineup of products just for this market under the Radeon Embedded moniker.

Now that we’ve seen AMD’s new Polaris architecture launch in the consumer world, AMD is taking the next step by refreshing the Radeon Embedded product lineup to use these new parts. To that end, this morning AMD is announcing two new Radeon Embedded video cards: the E9260 and the E9550. Based on the Polaris 11 and Polaris 10 GPUs respectively, these parts are updating the “high performance” and “ultra-high performance” segments of AMD’s embedded offerings.

AMD Embedded Radeon Discrete Video Cards
  Radeon E9550 Radeon E9260 Radeon E8950 Radeon E8870
Stream Processors 2304 896 2048 768
GPU Base Clock 1.12GHz ? 750MHz 1000MHz
GPU Boost Clock ~1.26GHz ~1.4GHz N/A N/A
Memory Clock 7Gbps GDDR5 7Gbps GDDR5? 6Gbps GDDR5 6Gbps GDDR5
Memory Bus Width 256-bit 128-bit 256-bit 128-bit
VRAM 8GB 4GB 8GB 4GB
Displays 6 5 6 6
TDP Up To 95W Up To 50W 95W 75W
GPU Polaris 10 Polaris 11 Tonga Bonaire
Architecture GCN 4 GCN 4 GCN 1.2 GCN 1.1
Form Factor MXM MXM & PCIe MXM MXM & PCIe

We’ll start things off with the Embedded Radeon E9550, which is the new top-performance card in AMD’s embedded lineup. Based on AMD’s Polaris 10 GPU, this is essentially an embedded version of the consumer Radeon RX 480, offering the same number of SPs at roughly the same clockspeed. This part supersedes the last-generation E8950, which is based on AMD’s Tonga GPU, and is rated to offer around 93% better performance, thanks to the slightly wider GPU and generous clockspeed bump.

The E9550 is offered in a single design, an MXM Type-B card that’s rated for 95W. These embedded-class MXM cards are typically based on AMD’s mobile consumer designs, and while I don’t have proper photos for comparison – AMD’s supplied photos are stock photos of older products – I’m sure it’s the same story here. Otherwise, the card is outfitted with 8GB of GDDR5, like the E8950 before it, and is capable of driving up to 6 displays. Finally, AMD will be offering the card for sale for 3 years, which again is par for the course here for AMD.

Following up behind the E9550 is the E9260, the next step down in the refreshed Embedded Radeon lineup. This card is based on AMD’s Polaris 11 GPU, and is similar to the consumer Radeon RX 460, meaning it’s not quite a fully enabled GPU. Within AMD’s lineup it replaces the E8870, offering 2.5 TFLOPS of single precision floating point performance to the former’s 1.5 TFLOPS. AMD doesn’t list official clockspeeds for this card, but based on the throughput rating this puts its boost clock at around 1.4GHz. The card is paired with 4GB of GDDR5 on a 128-bit bus.

Meanwhile on the power front, the E9260 is being rated for up to 50W. Notably, this is down from the 75W designation of its predecessor, as the underlying Polaris 11 GPU aims for lower power consumption. And unlike its more powerful sibling, the E9260 is being offered in two form factors: an MXM Type-A card, and a half height half length (HHHL) PCIe card. Both cards have identical performance specifications, differing only in their form factor and display options. Both cards can support up to 5 displays, though the PCIe card only has 4 physical outputs (so you’d technically need an MST hub for the 5th). Finally, both versions of the card will be offered by AMD for 5 years, which at this point would mean through 2021.

Moving on, besides the immediate performance benefits of Polaris, AMD is also looking to leverage Polaris’s updated display controller and multimedia capabilities for the embedded market. Of particular note here is support for full H.265 video encoding and decoding, something the previous generation products lacked. And display connectivity is greatly improved too, with both HDMI 2.0 support and DisplayPort 1.3/1.4 support.

The immediate market for these cards will be the same general markets that previous generation products have been pitched at, including digital signage, casino gaming, and medical, all of whom make use of GPUs in various degrees and need parts to be available for a defined period of time. Across all of these markets AMD is especially playing up the 4K and HDR capabilities of the new cards, along of course with overall improved performance.

At the same time however, AMD’s embedded group is also looking towards the future, trying to encourage customers to make better use of their GPUs for compute tasks, a market AMD considers to be in its infancy. This includes automated image analysis/diagnosis, machine learning inferencing to allow a casino machine or digital sign to react to a customer, and GPU beamforming for medical. And of course, AMD always has an eye on VR and AR, though for the embedded market in particular that’s going to be more off the beaten path.

Wrapping things up, AMD tells us that the new Embedded Radeon cards will be shipping in the next quarter. The E9260 will be shipping in production in the next couple of weeks, while the E9550 will be coming towards the end of Q4.