Commercially deployable AI compute has enabled applications like advanced driving assistance, robotics, smart manufacturing, and medical imaging to take exciting steps forward. Initially better suited to GPU-type processors than conventional CPUs, AI has driven the development of new, optimized architectures including highly parallelized structures in programmable devices.
Accelerators based on programmable, adaptive compute have been deployed in the cloud to execute AI algorithms efficiently at high speed. With growing adoption of AI in autonomous systems that must deliver real-time response and determinism, delivering intelligence to edge applications through cloud connectivity is less viable. Instead, moving intelligence to the edge reduces latency, enhances mission-critical reliability, and increases data privacy.
Intensive edge-AI compute comes with its own challenges, however. Power consumption is usually critical, there are often thermal constraints, and human safety must be assured. Moreover, as well as handling AI processing locally, an industrial robot or autonomous drive system must also manage communications, safety mechanisms, handle input-data streams from multiple sensor channels such as cameras, inertial sensors, and proximity/distance sensors, and manage interactions with other peripherals like a display. This calls for hardware architectures that enable whole-application acceleration.
Automotive and Industrial Demands
In the automotive sector, current advanced autonomous-drive systems in today’s market are typically Level 3 (SAE J3016) systems such as park assist and highway auto-pilot. These are already at a high level of complexity, with multiple sensors including LiDAR, radar, external cameras for forward/reverse looking and surround-view, sometimes with additional cameras and time-of-flight sensing for driver-monitoring. The next generations of driver assistance systems and, eventually, self-driving vehicles, will require the integration of yet more sensing channels for contextual awareness while also demanding increased compute, lower power, smaller footprint, and reduced bill of materials.
In industrial robotics, on the other hand, AI allows powers like pattern recognition to assist fast and accurate picking, placement, and inspection. To move freely and safely in unstructured environments, tomorrow’s robots must analyze and anticipate the movements of nearby humans to maximize safety as well as productivity. Real-time response and determinism, managing sensors for contextual awareness, motion control, positioning, and vision, will be mission critical as well as safety critical. All this, while capturing copious data for traceability and long-term trends analysis.
As a category of mobile robots, UAVs are a high-growth opportunity in commercial and defense markets. Generally, vision-based AI is key to autonomous operation, dependent on real-time image recognition and object classification for optimizing navigation paths. Defense-related applications are adopting AI-augmented tactical communications with techniques such as “human-in-the-loop” based on machine learning to assist decision making. These, and drones for emergency services, will also use advanced cognitive RF to autonomously self-reconfigure for any frequency band, modulation, and access standard, leveraging AI to adapt and learn.
More than AI processing, dealing with multiple sensing modalities such as radio, radar, vision, and others needed for detailed situational awareness is potentially the greatest bottleneck to performance. Adaptable compute can meet these needs and host the complete system resulting in lower size, weight, and power.
Accelerating Edge AI
The processing architectures that have delivered high-performance AI in the cloud can be harnessed to enable these new intelligent edge devices. Xilinx Versal® (figure 1) brings various differently optimized processors onto one chip, with memory, peripherals, and platform-wide high-bandwidth interconnect.
One of Versal’s most important innovations is the AI-Engine array (figure 2), which brings together hundreds of vector processing units (VPUs).
Each VPU has RAM located directly alongside, allowing low-latency access across a wide-bandwidth interface. Neighboring VPUs can share this RAM, creating a cache-less hierarchy that can be flexibly allocated. An interconnect with bandwidth equivalent to 10s of terabytes per second links the VPUs. Together, the vector core array, distributed memory, and flexible interconnect create a software-programmable architecture that is hardware adaptable and delivers high bandwidth and parallelism with low power and latency. Signal-processing throughput is up to ten times greater than GPUs for accelerating cloud AI, vision processing, and wireless processing.
Now, Versal AI Edge adaptive platforms further optimize this architecture to meet the special needs of intelligent edge devices. In particular, the AI Engine-ML (figure 3) is a variation of the AI-Engine, optimized for machine learning.
Compared to the standard Versal AI Engine, AI Engine-ML natively supports faster and more efficient handling of INT4 integer quantization and BFLOAT16 16-bit floating-point calculations common to machine-learning applications. Also, extra multipliers in the core have doubled INT8 performance. In addition, extra data memory further boosts localization and a new memory tile provides up to 38 Megabytes across the AI-Engine array with high memory-access bandwidth. Overall, machine-learning compute is increased four-fold and latency halved.
The AIE-ML can handle heterogeneous workloads as well as machine learning, while the existing AI Engines’ native 32-bit processing accelerates arithmetic such as INT32 and FP32. This maintains balanced performance across diverse workloads like beamforming in ultrasound applications, and LiDAR/radar processing.
Interactions with memory represent a notorious performance bottleneck in processing systems. The new edge-focused platform addresses this by adding accelerator RAM, a4MB block with a high-bandwidth interface that is accessible to all processing engines on the chip. This can save AI-Engines storing critical compute data from neural networks in DDR RAM during high-speed inference, thereby reducing power consumption and latency, and lets the real-time processors keep safety-critical code on-chip for lower latency and safer operation.
Designers, whether seasoned FPGA developers or data scientists with little experience of designing hardware, can use these devices with their preferred tools, languages such as C or Python, and AI frameworks like Caffe, TensorFlow, and Pytorch. There are large numbers of AI and vision libraries to run on Versal Adaptable Engines or Intelligent Engines, support for Robot Operating System (ROS) and ROS2, and support for safety-critical design with RTOSes such as QNX, VxWorks, and Xen, catering for functional safety standards such as ISO 26262 automotive, IEC 61508 industrial, and DO-254 defense standards. Design to meet IEC 62443 IT security standards is also supported.
With its distributed architecture, the family is also scalable and can address applications from edge sensors and endpoints to CPU accelerators. Devices scale from eight AI Engines, drawing less than 10W total power, to over 300 AI Engines that deliver over 400 TOPS of INT4 throughput, in addition to Adaptable Engines and DSP Engines.
Whole-Application Acceleration
To understand the effects of all this on system performance, let’s consider the ADAS example described earlier, implemented on three Zynq devices. A single mid-range Versal AI Edge IC can host the same application (figure 4) and deliver 17.4 TOPS, more than four times the processing throughput, to handle increased camera definition and extra video channels within a comparable power budget of 20W. The IC and associated external circuitry occupy 529mm2, more than 58% less area.
The various workloads can be assigned appropriately among the available engines, using the Adaptable Engines for sensor fusion and pre-processing including time-of-flight ranging, radar and lidar point-cloud analysis, vision image conditioning, environment characterization, and data conditioning such as tiling. Simultaneously, the Scalar Engine can focus on decision making and vehicle control.
As a further advantage of this adaptive computes platform, the hardware configuration can be changed on the fly, within milliseconds, to let one device handle multiple functions. An automotive application may load a hardware configuration optimized for features such as lane-departure correction or anti-collision during highway driving, to be exchanged for a new configuration such as park-assist as the vehicle enters an urban area. After parking, another functionality could be loaded; maybe a differentiating feature like climate-control management to protect pets left in the vehicle.
Developers can also establish a highly optimized hardware implementation of their target application, perhaps bringing together solutions for LiDAR and radar sensing, or a front camera, on the same platform, each with custom AI and sensor-processing capabilities. This cannot be achieved with CPUs and GPUs. Moreover, as today’s automotive industry imposes rapidly changing requirements on automakers, the adaptive platform can be quickly changed over the air, similar to applying a software update.
Safety- and Mission-Critical Design
In industrial control and robotics, the Adaptable Engines can handle tasks like perception, control, networking, and navigation, while AI-Engines can augment control for dynamic execution and predictive maintenance and Scalar Engines handle safety-critical control, user-interface management, and cybersecurity.
In multi-mission and situationally aware UAVs, developers can use the Adaptable Engines for sensor fusion, waveform processing, and signal conditioning. The Intelligent Engines can focus on low-power, low-latency AI and signal conditioning for navigation and target tracking, as well as waveform classification for cognitive RF and spectrum awareness. The Scalar Engines handle command and control and can run in lock-step to ensure safety and security.
A single Versal AI Edge device can support multiple inputs including radio communication, navigation, target-tracking radar, and electro-optical and infrared sensors for visual reconnaissance (figure 5). Intensive sensor fusion like this would be difficult to do in a typical machine-learning processor.
Conclusion
AI has a central role in enabling the next generations of autonomous systems. Safety and performance depend on real-time, deterministic compute, although size, weight, and power are tightly constrained. Adaptive compute, proven in cloud AI applications, addresses the challenge with edge-optimized machine learning and whole-application acceleration.