NVIDIA AI Inference Platform

Giant Leaps in Performance and Efficiency for AI Services, from the Data Center to the Network’s Edge


The artificial intelligence revolution surges forward, igniting opportunities for businesses to reimagine how they solve customer challenges. We’re racing toward a future where every customer interaction, every product, and every service offering will be touched and improved by AI. Realizing that future requires a computing platform that can accelerate the full diversity of modern AI, enabling businesses to create new customer experiences, reimagine how they meet—and exceed—customer demands, and cost-effectively scale their AI-based products and services.

While the machine learning field has been active for decades, deep learning (DL) has boomed over the last six years. In 2012, Alex Krizhevsky of the University of Toronto won the ImageNet image recognition competition using a deep neural network trained on NVIDIA GPUs—beating all human expert algorithms that had been honed for decades. That same year, recognizing that larger networks can learn more, Stanford’s Andrew Ng and NVIDIA Research teamed up to develop a method for training networks using large-scale GPU computing systems. These seminal papers sparked the “big bang” of modern AI, setting off a string of “superhuman” achievements. In 2015, Google and Microsoft both beat the best human score in the ImageNet challenge. In 2016, DeepMind’s AlphaGo recorded its historic win over Go champion Lee Sedol and Microsoft achieved human parity in speech recognition.

GPUs have proven to be incredibly effective at solving some of the most complex problems in deep learning, and while the NVIDIA deep learning platform is the standard industry solution for training, its inference capability is not as widely understood. Some of the world’s leading enterprises from the data center to the edge have built their inference solution on NVIDIA GPUs.
Some examples include:

  • SAP’s Brand Impact Service achieved a 40X performance increase while reducing costs by 32X.
  • Bing Visual Search improved latency by 60X and reduced costs by 10X.
  • Cisco’s Spark Board and Spark Room Kit, powered by NVIDIA® Jetson™ GPU, enable wireless 4K video sharing and use deep learning for voice and facial recognition.
The Deep Learning Workflow

The two major operations from which deep learning produces insights are training and inference. While similar, there are significant differences. Training feeds examples of objects to be detected or recognized, like animals, traffic signs, etc., into a neural network, allowing it to make predictions as to what these objects are. The training process reinforces correct predictions and corrects the wrong ones. Once trained, a production neural network can achieve upwards of 90 to 98 percent correct results. “Inference” is the deployment of a trained network to evaluate new objects and make predictions with similar predictive accuracy.

Fig 1

Both training and inference start with the forward propagation calculation, but training goes further. After forward propagation during training, the results are compared against the “ground truth” correct answer to compute an error value. The backward propagation phase then sends the error back through the network’s layers and updates weights using the stochastic gradient descent to improve the network’s performance on the task it’s trying to learn. It’s common to batch hundreds of training inputs (for example, images in an image classification network or spectrograms for speech recognition) and operate on them simultaneously during deep neural network (DNN) training to amortize loading weights from GPU memory across many inputs, greatly increasing computational efficiency.

Inference can also batch hundreds of samples to achieve optimal throughput on jobs run overnight in data centers to process substantial amounts of stored data. These jobs tend to emphasize throughput over latency. However, for real-time usages, high batch sizes also carry a latency penalty. For these usages, lower batch sizes (as low as a single sample) are used, trading off throughput for lowest latency. A hybrid approach, sometimes referred to as “auto-batching,” sets a time threshold—say, 10 milliseconds (ms)—and batches as many samples as possible within those 10ms before sending them on for inference. This approach achieves better throughput while maintaining a set latency amount.

TensorRT Hyperscale Inference Platform

The NVIDIA TensorRT™ Hyperscale Inference Platform is designed to make deep learning accessible to every developer and data scientist anywhere in the world. It all starts with the world’s most advanced AI inference accelerator, the NVIDA Tesla® T4 GPU featuring NVIDIA Turing™ Tensor Cores. Based on NVIDIA’s new Turing architecture, Tesla T4 accelerates all types of neural networks for images, speech, translation, and recommender systems, to name a few. Tesla T4 supports a wide variety of precisions and accelerates all major DL frameworks, including TensorFlow, PyTorch, MXNet, Chainer, and Caffe2.

Since great hardware needs great software, NVIDIA TensorRT, a platform for high-performance deep learning inference, delivers low-latency, high-throughput inference for applications such as image classification, segmentation, object detection, machine language translation, speech, and recommendation engines. It can rapidly optimize, validate, and deploy trained neural network for inference to hyperscale data centers, embedded, or automotive GPU platforms. TensorRT optimizer and runtime unlocks the power of Turing GPUs across a wide range of precisions, from FP32 all the way down to INT8. In addition, TensorRT integrates with TensorFlow and supports all major frameworks through the ONNX format.

NVIDIA TensorRT Inference Server, available as a ready-to-run container at no charge from NVIDIA GPU Cloud, is a production-ready deep learning inference server for data center deployments. It reduces costs by maximizing utilization of GPU servers and saves time by integrating seamlessly into production architectures. NVIDIA TensorRT Inference Server simplifies workflows and streamlines the transition to a GPU-accelerated infrastructure for inference.

And for large-scale, multi-node deployments, Kubernetes on NVIDIA GPUs enables enterprises to scale up training and inference deployment to multi-cloud GPU clusters seamlessly. It lets software developers and DevOps engineers automate deployment, maintenance, scheduling, and operation of multiple GPU-accelerated application containers across clusters of nodes. With Kubernetes on NVIDIA GPUs, they can build and deploy GPU-accelerated deep learning training or inference applications to heterogeneous GPU clusters at scale seamlessly.

The Tesla T4 Tensor Core GPU, Based on NVIDIA Turing Architecture

The NVIDIA Tesla T4 GPU is the world’s most advanced accelerator for all AI inference workloads. Powered by NVIDIA Turing™ Tensor Cores, T4 provides revolutionary multi-precision inference performance to accelerate the diverse applications of modern AI. T4 is a part of the NVIDIA AI Inference Platform that supports all AI frameworks and provides comprehensive tooling and integrations to drastically simplify the development and deployment of advanced AI.

Turing Tensor Cores are purpose-built to accelerate AI inference and Turing GPUs also inherit all of the enhancements introduced to the NVIDIA CUDA® platform with the NVIDIA Volta™ architecture, improving the capability, flexibility, productivity, and portability of compute applications. Features such as independent thread scheduling, hardware-accelerated Multi-Process Service (MPS) with address space isolation for multiple applications, unified memory with address translation services, and Cooperative Groups are all part of the Turing GPU architecture.

NVIDIA Turing Innovations

Turing Key Features

New Streaming Multiprocessor (SM) with Turing Tensor Cores
The Turing SM builds on the major SM advancements of the Volta GV100 architecture and delivers major boosts in performance and energy efficiency compared to the previous-generation NVIDIA Pascal™ GPUs. Turing Tensor Cores not only provide FP16/FP32 mixed-precision matrix math like Volta Tensor Cores; they also add new INT8 and INT4 precision modes, massively accelerating a broad spectrum of deep learning inference applications.

Similar to Volta, the Turing SM provides independent floating-point and integer data paths, allowing a more efficient execution of common workloads with a mix of computation and address calculations. Also, independent thread scheduling enables finer-grain synchronization and cooperation among threads. Lastly, the combined shared memory and L1 cache improves performance significantly while simplifying programming.

Deep Learning Features for Inference
Turing GPUs deliver exceptional inference performance, versatility, and efficiency. Turing Tensor Cores, along with continual improvements in TensorRT, CUDA, and CuDNN libraries, enable Turing GPUs to deliver outstanding performance for inference applications. Turing also includes experimental features such as support for INT4 and INT1 formats to further research and development in deep learning.

GDDR6 High-Performance Memory Subsystem
Turing is the first GPU architecture to utilize GDDR6 memory, which represents the next big advance in high-bandwidth GDDR DRAM memory design that can deliver up to 320GB/sec. GDDR6 memory interface circuits in Turing GPUs have been completely redesigned for speed, power efficiency, and noise reduction. Turing’s GDDR6 memory subsystem delivers a 40 percent speedup and a 20 percent power efficiency improvement over GDDR5X memory used in Pascal GPUs.

Twice the Video Decode Performance
Video continues on its explosive growth trajectory, comprising over two-thirds of all internet traffic. Accurate video interpretation through AI is driving the most relevant content recommendations, finding the impact of brand placements in sports events, and delivering perception capabilities to autonomous vehicles, among other usages.

Tesla T4 delivers breakthrough performance for AI video applications, with dedicated hardware transcoding engines that bring twice the decoding performance of prior-generation GPUs. T4 can decode up to 38 full-HD video streams, making it easy to integrate scalable deep learning into the video pipeline to deliver innovative, smart video services. It features performance and efficiency modes to enable either fast encoding or the lowest bit-rate encoding without loss of video quality.

TensorRT 5 Features

The NVIDIA TensorRT Hyperscale Inference Platform is a complete inference solution that includes the cutting-edge Tesla T4 inference accelerator, the TensorRT 5 high-performance deep learning inference optimizer and runtime, and TensorRT Inference Server. This power trio delivers low latency and high throughput for deep learning inference applications and allows them to be quickly deployed. It can also leverage tools like Kubernetes, which can quickly scale containerized applications across multiple hosts. With TensorRT 5, neural network models can be optimized, calibrated for lower precision with high accuracy, and finally deployed to hyperscale data centers, embedded, or automotive product platforms. TensorRT-based applications on GPUs perform up to 50X faster than CPU during inference for models trained in all major frameworks.

TensorRT Optimizations

TensorRT Opts

TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications such as video streaming, speech recognition, recommendation, and natural language processing. Reduced-precision inference significantly lowers application latency while preserving model accuracy, which is a requirement for many real-time services as well as auto and embedded applications.

TensorRT and TensorFlow are now tightly integrated, giving developers the flexibility of TensorFlow with the powerful optimizations of TensorRT. MATLAB is integrated with TensorRT through GPU Coder so that engineers and scientists using MATLAB can automatically generate high-performance inference engines for Jetson, NVIDIA DRIVE™, and Tesla platforms.

TensorRT accelerates a wide diversity of usages, including images, video, speech recognition, neural machine translation, and recommender systems.

While it’s possible to do inference operations within a deep learning framework, TensorRT easily optimizes networks to deliver far more performance and it includes new layers for multilayer perceptrons (MLP) and recurrent neural networks (RNNs). TensorRT also takes full advantage of the Turing architecture. Later in this paper, data will show how this combination delivers up to 45X more throughput than a CPU-only server.

Tesla GPUs, paired with the TensorRT inference optimizer, deliver massive performance gains, both on convolutional neural networks (CNNs), often used for image-based networks, as well as RNNs that are frequently used for speech and translation applications.

Inference Performance: Getting the Complete Picture

Measured performance in computing tends to fixate on speed of execution. But in deep learning inference performance, speed is one of seven critical factors that come into play. A simple acronym, PLASTER captures these seven factors. They are:


Deep learning is a complex undertaking and so can be choosing the right deep learning platform. All seven of these elements should be included in any decision analysis, and many of these factors are interrelated. Consider these seven factors and the role each plays.

Programmability: Machine learning is experiencing explosive growth, not only in the size and complexity of the models, but also in the burgeoning diversity of neural network architectures. NVIDIA addresses training and inference challenges with two key tools—CUDA and TensorRT, NVIDIA’s programmable inference accelerator.

In addition, NVIDIA’s deep learning platform accelerates ALL deep learning frameworks, both for training and inference.

Latency: Humans and machines need a response to an input to make decisions and take action. Latency is the time between requesting something and receiving a response. While AI continues to evolve rapidly the latency targets for real-time services remain a constant. For example, there is wide demand for digital assistants in both consumer and customer service applications. But when humans try to interface with digital assistants, a lag of even a few seconds starts to feel unnatural.

Accuracy: While accuracy is important in every industry, healthcare needs especially high accuracy. Medical imaging has advanced significantly in the last couple of decades, increasing usage and requiring more analysis to identify medical issues. Medical imaging advancements and usage also mean that large volumes of data must be transmitted from medical machines to medical specialists for analysis. An advantage of deep learning is that it can be trained at high precision and implemented at lower precision.

Size of Network: The size of a deep learning model and the capacity of the physical network between processors have impacts on performance, especially in the latency and throughput aspects of PLASTER. Deep learning network models are exploding in numbers. Their size and complexity are also increasing, enabling far more detailed analysis and driving the need for more powerful systems for training.

Throughput: Developers are increasingly optimizing inference within a specified latency threshold. While the latency limit ensures good customer experience, maximizing throughput within that limit is critical to maximizing data center efficiency and revenue. There’s been a tendency to use throughput as the only performance metric, as more computations per second generally lead to better performance across other areas. However, without the appropriate balance of throughput and latency, the result can be poor customer service, missing service-level agreements (SLAs), and potentially a failed service.

Energy Efficiency: As DL accelerator performance improves, power consumption escalates. Providing a return on investment (ROI) for deep learning solutions involves more than looking at just the inference performance of a system. Power consumption can quickly increase the cost of delivering a service, driving a need to focus on energy efficiency in both devices and systems. Therefore, the industry measures operational success in inferences per watt (higher is better). Hyperscale data centers seek to maximize energy efficiency for as many inferences as they can deliver within a fixed power budget.

Rate of Learning: One of the two words in “AI” is “intelligence,” and users want neural networks to learn and adapt within a reasonable time frame. For complex DL systems to gain traction in business, software tool developers must support the DevOps movement. DL models must be retrained periodically as inference services gather new data and as services grow and change. Therefore, IT organizations and software developers must increase the rate at which they can retrain models as new data arrives.


Image-based networks are used for image and video search, video analytics, object classification and detection, and a host of other usages. Looking at an image-based dataset (ImageNet) run on three different networks, a single Tesla P4 GPU is 12X faster than a CPU-only server, while the Tesla V100 Tensor Core GPU is up to 45X faster than that same CPU-only server.


RNNs are used for time-series or sequential data and are often applied as solutions for translation, speech recognition, natural language processing, and even speech synthesis. The data shown here are from the OpenNMT (neural machine translation) network, translating a dataset from German to English. The Tesla P4 is delivering 81X more throughput, while the Tesla V100 Tensor Core GPU is an even more impressive 352X faster than a CPU-only server.


Low-Latency Throughput

While achieving high throughput is critical to inference performance, so is achieving latency. Many real-time use cases actually involve multiple inferences per query. For instance, a spoken question will traverse an automatic speech recognition (ASR), speech to text, natural language processing, a recommender system, text to speech and then speech synthesis. Each of these steps is a different inference operation. And while some pipelining is possible, the latency of each of those inference operations contributes to the latency of the overall experience. Shown here are low-latency throughputs for both CNNs as well as an RNN. Developers have generally approached low-latency inference two ways: 1) processing requests immediately as they come in with no batching (also called batch size 1) or 2) using an “auto-batching” technique where a latency limit is set (e.g., 7ms), samples are batched until either that limit is hit or a certain batch size is achieved (e.g., batch size 8), and then the work is sent through the network for inference processing. The former approach is easier to implement, while the latter approach has the advantage of delivering more inference throughput while preserving the prescribed latency limit. To that end, we present CNN results using the 7ms latency budget approach, while the RNN results are shown using a batch size of one.


There have been some misconceptions that GPUs are unable to achieve very low latency at a batch size of one. However, as the chart below reflects, Tesla P4 and Tesla V100 are delivering 1.8 and 1.1ms, respectively, at a batch size of one, whereas a CPU server is at 6ms. In addition, the CPU server is delivering only 163 images per second, whereas Tesla P4 is at 562 images per second and the Tesla V100 is delivering 870 images per second.


Performance Efficiency

We’ve covered maximum throughput already, and while very high throughput on deep learning workloads is a key consideration, so is how efficiently a platform can deliver that throughput.
Here we offer a first look at the Turing-based Tesla T4, whose efficiency far exceeds either the Tesla P4 or the Tesla V100. With its small form factor and 70-watt (W) footprint design, Tesla T4 is the world’s most advanced universal inference accelerator. Powered by Turing Tensor Cores, T4 will bring revolutionary multi-precision inference performance to efficiently accelerate the diverse applications of modern AI. In addition, Tesla T4 is poised to more than double the efficiency of its predecessor, the Tesla P4.


GPU Inference: Business Implications

Tesla V100 and P4 deliver massive performance boosts and power efficiency, but how does that benefit acquisition and operations budgets? Simply put, big performance translates into big savings.

The image below shows that a single server with 16 Tesla T4 GPUs running speech, NLP and video usages provides the same throughput performance as 200 CPU-based servers that take up four entire racks and require 60KW of power. The result? This single Tesla T4–equipped server will deliver a 30X reduction in power and a 200X reduction in server count.


Jetson: Inference at the Edge

NVIDIA Jetson TX2 is a credit card–sized open platform that delivers AI computing at the edge—opening the door to powerfully intelligent factory robots, commercial drones, and smart cameras for AI cities. Based on the NVIDIA Pascal architecture, Jetson TX2 offers twice the performance of its predecessor, or it can run at more than twice the power efficiency while drawing less than 7.5 watts (W) of power. This allows Jetson TX2 to run larger, deeper neural networks on edge devices. The result: smarter devices with higher accuracy and faster response times for tasks like image classification, navigation, and speech recognition. Deep learning developers can use the very same development tools for Jetson that they use on the Tesla platform, such as CUDA, cuDNN, and TensorRT.

Jetson TX2 was designed for peak processing efficiency at 7.5W of power. This level of performance, referred to as Max-Q, represents the maximum performance and maximum power efficiency range on the power/performance curve. Every component on the module, including the power supply, is optimized to provide the highest efficiency at this point. The Max-Q frequency for the GPU is 854MHz. For the ARM A57 CPUs, it’s 1.2GHz. While dynamic voltage and frequency scaling (DVFS) permits the NVIDIA Tegra® “Parker” system on a chip (SoC), which Jetson TX2 is based on, to adjust clock speeds at run time according to user load and power consumption, the Max-Q configuration sets a cap on the clocks to ensure that the application operates in the most efficient range only.

Jetson enables real-time inference when connectivity to an AI data center is either not possible (e.g., in the case of remote sensing) or the end-to-end latency is too high for real time use (e.g., in the case of autonomous drones). Although most platforms with a limited power budget will benefit most from Max-Q behavior, others may prefer maximum clocks to attain peak throughput, albeit with higher power consumption and reduced efficiency. DVFS can be configured to run at a range of other clock speeds, including underclocking and overclocking. Max-P, the other preset platform configuration, enables maximum system performance in less than 15W. The Max-P frequency is 1.12GHz for the GPU and 2GHz for the CPU when either the ARM A57 cluster is enabled or the Denver 2 cluster is enabled and 1.4GHz when both clusters are enabled.


For many network-edge applications, low latency is a must-have. Executing inference on device is a far more optimal approach than trying to send this work over a wireless network and in and out of a CPU-based server in a remote data center. In addition to its on-device locality, Jetson TX2 also delivers outstanding low-latency on small-batch workloads, usually under 10ms. For comparison, a CPU-based server has a latency of around 23ms, and adding roundtrip network and data center travel time, that figure can be well over 100ms.

The Rise of Accelerated Computing

Google has announced its Cloud Tensor Processing Unit (TPU) and its applicability to deep learning training and inference. And while Google and NVIDIA chose different development paths, there are several themes common to both approaches. Specifically, AI requires accelerated computing. Accelerators provide the significant data processing necessary to keep up with the growing demands of deep learning in an era when Moore’s law is slowing. Tensor processing—a major new workload that enterprises must consider when building modern data centers—is at the core of delivering performance for deep learning training and inference. Accelerating tensor processing can dramatically reduce the cost of building modern data centers.

According to Google, the TPUv2 (also referred to as “TPU 2.0”) has become available as a “Cloud TPU, which consists of four TPUv2 chips., But comparing chip to chip, a single TPU chip can deliver 45TFLOPS of computing horsepower per chip. NVIDIA’s Tesla V100 can deliver 125TFLOPS of deep learning performance for both training and inference. An 8-GPU configuration such as NVIDIA DGX-1™ can now deliver a petaflop (PFLOP) of deep learning computing power.

NVIDIA’s approach democratizes AI computing for every company, every industry, and every computing platform and accelerates every development framework—from the cloud, to the enterprise, to cars, and to the edge. Google and NVIDIA are the clear leaders, collaborating closely while taking different approaches to enable the world with AI.

Note on FPGAs

As the deep learning field continues to grow rapidly, other types of hardware have been proposed as potential solutions for inference, such as field-programmable gate arrays (FPGA). FPGAs are used for specific functions in network switches, 4G base stations, motor control in automotive, and test equipment in semiconductors, among other use cases. It’s a sea of general-purpose programmable logic gates designed to simulate an application-specific integrated circuit (ASIC) for various usages, so long as the problem fits on the chip. But because these are programmable gates rather than a hard-wired ASIC, FPGAs are inherently less efficient.

At its recent Build conference, Microsoft claimed its FPGA-based Project BrainWave inference platform could deliver about 500 images per second on the ResNet-50 image network. However, to put this in perspective, a single Tesla P4 GPU can deliver more than 3X that throughput, or 1,676 images per second in a 75W solution. To compare further, shown here are projections made in a recent Intel whitepaper regarding their Altera and Stratix FPGAs. Note these results are run on the GoogleNet network.


Programmability and Time to Solution Considerations

The speed of deep learning innovation drives the need for a programmable platform that enables developers to quickly try new network architectures, and iterate as new findings come to light. Recall that Programmability is the “P” in the PLASTER framework. There has been a Cambrian explosion of new network architectures that have emerged over the last several years, and this rate of innovation shows no signs of slowing.


Another challenge posed by FPGAs is that, in addition to software development, they must be reconfigured at the hardware level to run each iteration of new neural network architectures. This complex hardware development slows time to solution and, hence, innovation by weeks and sometimes months. On the other hand, GPUs continue to be the most programmable platform of choice for quickly prototyping, testing, and iterating cutting-edge network designs, thanks to robust framework acceleration support, dedicated deep learning logic like Tesla V100’s Tensor Cores, and TensorRT to optimize trained networks for deployed inference.


Deep learning is revolutionizing computing, impacting enterprises across multiple industrial sectors. The NVIDIA deep learning platform is the industry standard for training, and leading enterprises are already deploying GPUs for their inference workloads, leveraging its powerful benefits. Neural networks are rapidly becoming exponentially larger and more complex, driving massive computing demand and cost. In cases where AI services need to be responsive, modern networks are too compute-intensive for traditional CPUs.

Inference performance has seven aspects best remembered by PLASTER: programmability, latency, accuracy, size of network, throughput, efficiency, and rate of learning. All are critical to delivering both data center efficiency and great user experiences. This paper demonstrates how Tesla GPUs can deliver up to a 200X reduction in servers needed in the data center for “offline inference” use cases. In fact, the savings in energy costs alone more than pays for the Tesla-powered server. And at the network’s edge, Jetson TX2 brings server-class inference performance in less than 10W of power and enables device-local inference to significantly cut inference latency times. These big improvements will enable state-of-the-art AI to be used end-to-end in real-time services that include speech recognition, speech-to-text, recommender systems, text-to-speech and speech synthesis.

An effective deep learning platform must have three distinct qualities: 1) It must have a processor custom-built for deep learning. 2) It must be software-programmable. 3) And industry frameworks must be optimized for it, powered by a developer ecosystem that is accessible and adopted around the world. The NVIDIA deep learning platform is designed around these three qualities and is the only end-to-end deep learning platform. From training to inference. From data center to the network’s edge.

Download the file below.

Files to download:
How Can We Help You? Ask a Sales Engineer
May Advanced HPC contact you with future offers and announcements?

*Required fields