AI-HPC is Happening Now
InsideHPC Special Report by Rob Farber
People speak of the Artificial Intelligence (AI) revolution where computers are being used to create data-derived models that have better than human predictive accuracy. The end result has been an explosive adoption of the technology that has also fueled a number of extraordinary scientific, algorithmic, software, and hardware developments.
Pradeep Dubey, Intel Fellow and Director of Parallel Computing Lab, notes: “The emerging AI community on HPC infrastructure is critical to achieving the vision of AI: machines that don’t
just crunch numbers, but help us make better and more informed complex decisions.”
Using AI to make better and more informed decisions as well as autonomous decisions means bigger and more complex data sets are being used for training. Self-driving cars are a popular
example of autonomous AI, but AI is also being used to make decisions to highlight scientifically relevant data such as cancer cells in biomedical images, find key events in High Energy Physics (HEP) data, search for rare biomedical events, and much more.
“The emerging AI community on HPC infrastructure is critical to achieving the vision of AI: machines that don’t just crunch numbers, but help us make better and more informed complex
– Pradeep Dubey, Intel Fellow and Director of Parallel Computing Lab
Scalability is the key to AI-HPC so scientists can address the big compute, big data challenges facing them and to make sense from the wealth of measured and modeled or simulated data that is now available to them.
- In the HPC community, scientists tend to focus on modeling and simulation such as Monte-Carlo and other techniques to generate accurate representations of what goes on in nature. In the high energy and astrophysics communities, this means examining images using deep learning. In other areas, this means examining time series, performing
signal processing, bifurcation analysis, harmonic analysis, nonlinear system modeling and prediction and much more.
- Similarly, empirical AI efforts attempt to generate accurate representations of what goes on in nature from unstructured data gathered from remote sensors, cameras, electron microscopes, sequencers and the like. The uncontrolled nature of this unstructured data means that data scientists spend much of their time extracting and cleansing meaningful information with which to train the AI system. The good news is that current Artificial Neural Network (ANN) systems are relatively robust (meaning they exhibit reasonable behavior) when presented with noisy and conflicting data. Future technologies such as neuromorphic chip promise both self-taught and self-organizing ANN systems that can be devised in the future that directly learn from observed data.
So, HPC and the data driven AI communities are converging as they are arguably running the same types of data and compute intensive workloads on HPC hardware, be it on a leadership class supercomputer, small institutional cluster, or in the cloud.
The focus needs to be on the data, which means the software and hardware need to “get out of the way”. Major AI software efforts are focusing on assisting the data scientist so they can use familiar software tools that can run anywhere and at scale on both current and future hardware solutions. Much of the new hardware that is being developed addresses expanding user needs caused by the current AI explosion.
The demand for performant and scalable AI solutions has stimulated a convergence of science, algorithm development, and affordable technologies to create a software ecosystem
designed to support the data scientist—no ninja programming required!
This white paper is a high level educational piece that will address this convergence, the existing ecosystem of affordable technologies, and the current software ecosystem that everyone can
use. Use cases will be provided to illustrate how people apply these AI technologies to solve HPC problems.
Section 1: An Overview of AI in the HPC Landscape
Understanding the terms and value proposition
It is important to understand what we mean by AI as this is a highly overloaded term. It is truly remarkable that machines, for the first time in human history, can deliver better than human
accuracy on complex ‘human’ activities such as facial recognition, and further that this better than- human capability was realized solely by providing the machine with example data in what is called a training set. This has driven the AI explosion due to an increase in compute capability and architectural advancements, innovation in AI technologies, and the increase in available data.
The term ‘artificial Intelligence’ has been associated with machine learning and deep learning technology for decades. Machine learning describes the process of a machine programming
itself (e.g. learning) from data. The popular phrase ‘deep learning’ encompasses a subset of the more general term ‘machine learning’.
Originally, deep learning was used to describe the many hidden layers that scientists used to mimic the many neuronal layers in the brain. Since then, it has become a technical term used to
describe certain types of artificial neural networks (ANNs) that have many hidden or computational layers between the input neurons, where data is presented for training or inference and the output neuron layer where the numerical results can be read. The numerical values of the “weights” (also known as parameters of the ANNs) that guide the numerical results close to the “ground truth” contain the information that companies use to identify faces, recognize speech, read text aloud, and provide a plethora of new and exciting capabilities.
The forms of AI that use machine learning require a training process that fits the ANN model to the training set with low error. This is done by adjusting the values or parameters of the ANN. Inferencing refers to the application of that trained model to make predictions, classify data, make money, etcetera. In this case, the parameters are kept fixed based on a fully trained model.
Inferencing can be done quickly and even in realtime to solve valuable problems. For example, the following graphic shows the rapid spread of speech recognition (which uses AI) in the data
center as reported by Google Trends. Basically, people are using the voice features of their mobile phones more and more.
Figure 1: Google Voice Search Queries
Inferencing also consumes very little power, which is why forward thinking companies are incorporating inferencing into edge devices, smart sensors and IoT (Internet of Things) devices.Inferencing also consumes very little power, which is why forward thinking companies are incorporating inferencing into edge devices, smart sensors and IoT (Internet of Things) devices. Many data centers exploit the general purpose nature of CPUs like Intel® Xeon® processors to perform volume inferencing in the data center or cloud. Others are adding FPGAs for extremely low-latency volume inferencing.
Training on the other hand is very computationally expensive and can run 24/7 for very long periods of time (e.g. months or longer). It is reasonable to think of training as the process of adjusting the model weights of the ANN by performing a large number of inferencing operations in a highly parallel fashion for a fixed training set during each step of the optimization procedure and readjusting the weights that result in minimizing the error from the ground truth. Thus, parallelism and scalability as well as floating-point performance are key to finding the model that accurately represents the training data quickly with a low error.
Now you should understand that all the current wonders of AI are the result of this model fitting, which means we are really just performing math. It’s best to discard the notion that humans have a software version of C3PO from Star Wars running in the computer. That form of AI still resides in the future. Also, we will limit ourselves to the current industry interest in ANNs, but note that there are other, less popular, forms of AI such as Genetic Algorithms, Hidden Markov Models, and more.
Scalability is a requirement as training is considered a big data problem because as demonstrated in the paper, How Neural Networks Work, the ANN is essentially fitting a complex
People are interested in using AI to solve complex problems, which means the training process needs to fit very rough, convoluted and bumpy surfaces. Think of a boulder field. There are lots and lots of points of inflection—both big and small that define where the all the rocks and holes are as well as all their shapes. Now increase that exponentially as we attempt to fit a hundred or thousand dimensional ‘boulder field’.
Thus, it takes lots of data to represent all the important points of inflection (e.g. bumps and crevasses) because there are simply so many of them. This explains why training is generally
considered a big data problem. Succinctly: smooth, simple surfaces require little data, while most complex real-world data sets require lots of data.
Figure 2: Image courtesy of wikimedia commons
Accuracy and time-to-model are all that matter when training
It is very important to understand that time-to-model and the accuracy of the resulting model are really the only performance metrics that matter when training because the goal is to quickly develop a model that represents the training data with high accuracy.
For machine learning models in general, it has been known for decades that training scales in a near-linear fashion, which means that people have used tens and even hundreds of thousands
of computational nodes to achieve petaflop/s training performance. Further, there are well established numerical methods such as Conjugate Gradient and L-BGFS to help find the best set of model parameters during training that represent the data with low error. The most popular machine learning packages make it easy to call these methods.
A breakthrough in deep learning scaling
For various reasons, distributed scaling has been limited for deep learning training applications. As a result, people have focused on hardware metrics like floating-point, cache, and memory subsystem performance to distinguish which hardware platforms are likely to perform better than others and deliver that desirable fast time-to-model.
How fast the numerical algorithm used during training reaches (or converges to) a solution greatly affects time-to-model. Today, most people use the same packages (e.g. Caffe*, TensorFlow*, Theano, or the Torch middle-ware), so the convergence rate to a solution and accuracy of the resulting solution have been neglected. The assumption is that the same software and algorithms shouldn’t differ too much in convergence behavior across platforms, hence the accuracy of the models should be the same or very close. This has changed as new approaches to distributed training have disrupted this assumption.
A recent breakthrough in deep-learning training occurred as a result work by collaborators with the Intel® Parallel Computing Center (Intel® PCC) that now brings fast and accurate distributed deep learning training to everyone, regardless if they run on a leadership class supercomputer or on a workstation.
“Scaling deep-learning training from single-digit nodes just a couple of years back to almost 10,000 nodes now, adding up to more than ten petaflop/s is big news.”
– Pradeep Dubey, Intel Fellow and Director of Parallel Computing Lab
Succinctly, a multiyear collaboration between Stanford, NERSC, and Intel established that the training of deep-learning ANNs can scale to 9600 computational nodes and deliver 15PF/s of deep learning training performance. Prior to this result, the literature only reported scaling of deep learning training to a few nodes. Dubey observed “scaling deep-learning from a few nodes to more than ten petaflop/s is big news” as deep learning training is now a member of the petascale club.
Putting it all together, the beauty of the math being used to power the current AI revolution is that once the data and model of the ANN have been specified, performance only depends on the training software and underlying computer hardware. Thus, the best “student” is the computer or compute cluster that can deliver the fastest time to solution while finding acceptable accuracy solutions. This means that those lucky enough to run on leadership class supercomputers can address the biggest and most challenging problems in the world with AI.
Joe Curley, Director of HPC Platform and Ecosystem Enabling at Intel, explained the importance of these accomplishments for everyone: “AI is now becoming practical to deploy at scale”.
In short, neither a leadership-class supercomputer nor specialized hardware is required to achieve fast time-to-model and high accuracy—even on big data sets. This means HPC scientists can work locally on fast workstations and small clusters using optimized software. If required, HPC scientists can then run at scale on a large supercomputer like Cori to solve leadership class scientific problems using AI. Thus, it is critical to have both optimized and scalable tools in the AI ecosystem.
Inferencing is the operation that makes data derived models valuable because they can predict the future and perform recognition tasks better than humans.
Inferencing works because once the model is trained (meaning the bumpy surface has been fitted) the ANN can interpolate between known points on the surface to correctly make predictions for data points it has never seen before—meaning they were not in the original training data. Without getting too technical, ANNs perform this interpolation on a nonlinear (bumpy) surface, which means that ANNs can perform better than a straight line interpolation like a conventional linear method. Further, ANNs are also able to extrapolate from known points on the fitted surface to make predictions for data points that are outside of the range of data the ANN saw during training. This occurs because the surface being fitted is continuous. Thus, people say the trained model has generalized the data so it can make correct predictions.
Don’t limit your thinking about what AI can do
The popularity of “better than human” classification on tasks that people do well (such as recognizing faces, Internet image search, self-driving cars, etcetera) has reached a mass audience. What has been lacking is coverage that machine learning is also fantastic at performing tasks that humans tend to do poorly.
For example, machine learning can be orders of magnitude better at signal processing and nonlinear system modeling and prediction than other methods. As a result, machine learning
has been used to model electrochemical systems, model bifurcation analysis, perform Independent Component Analysis, and much, much more for decades.
Similarly, an autoencoder can be used to efficiently encode data using analogue encoding (which can be much more compact than traditional digital compression methods), perform PCA (Principle Components Analysis), NLPCA (Nonlinear Principle Component Analysis), and perform dimension reduction. Many of these methods are part and parcel of the data scientist’s toolbox.
Dimension reduction is of particular interest to anyone who uses a database.
Basically, an autoencoder addresses the curse of dimensionality problem where every point effectively becomes a nearest neighbor to all the other points in the database as the dimension
of the data increases. People quickly discover that they can put high dimensional data into a data base, but their queries either return either no data or all the data. An autoencoder that is
trained to represent the high dimensional data in a lower dimension with low error means all the relationships between the data points are preserved. Thus, allowing database searches to
find similar items in the lower dimensional space. In other words, autoencoders can allow people to perform meaningful searches on high-dimensional data, which can be a big win for many people. Autoencoders are also very useful for those who wish to visualize high-dimensional data.
Platform perceptions are changing
Popular perceptions about the hardware requirements for deep learning are changing as CPUs continue to be the desired hardware training platform in the data center. The reason is that CPUs deliver the required parallelism plus memory and cache performance to support the massively parallel FLOP/s intensive nature of training, yet they can also efficiently support both general purpose workloads and machine learning datapreprocessing workloads. Thus, data scientists and data centers are converging on similar hardware solutions, namely to hardware that can meet all their needs and not just accelerate one aspect of machine learning. This reflects recent data center procurement trends like MareNostrum 4 at Barcelona Supercomputing Center and the TACC Stampede2 upgrade, both of which were established to provide users with general workload support.
“Lessons Learned”: Pick a platform that supports all your workloads
In particular, don’t ignore data preprocessing as the extraction and cleaning of training data from unstructured data sources can be as big a computational problem as the training process
itself. Most deep learning articles and tutorials neglect this “get your hands dirty with the data” issue, but the importance of data preprocessing cannot be overstated!
Data scientists tend to spend most of their time working with the data. Most are not ninja programmers so support for the productivity languages they are familiar with is critical to having them work effectively. After that, the hardware must support big memory and the performance of a solid state storage subsystem to get them to time-to-model performance considerations.
Thus, the popularity of AI today reflects the convergence of scalable algorithms, distributed training frameworks, hardware, software, data preprocessing, and productivity languages so
people can use deep learning to address their computation models, regardless of how much data they might have—and regardless of what computing platform they may have.
Integrating AI into infrastructure and applications
Intel is conducting research to help bridge the gap and bring about the much needed HPC-AI convergence. The IPCC collaboration that achieved 15 PF/s of performance using 9600 Intel® Xeon Phi processor based nodes on the NERSC Cori supercomputer is one example.
An equally important challenge in the convergence of HPC and AI is the gap between programming models. HPC programmers can be “parallel programming ninjas”, but deep learning and machine learning is mostly programmed using MATLAB-like frameworks. The AI community is evolving to address the challenge of delivering scalable, HPC-like performance for AI applications without the need to train data scientists in low-level parallel programming.
Further, vendors like Intel are addressing the software issue at the levels of library, language, and runtime. More specifically, in collaboration with two university partners, Intel has achieved a significant (more than 10x) performance improvement through libraries calls and also by enabling MPI libraries to become efficiently callable from Apache Spark*, an approach that was described in Bridging the Gap Between HPC and Big Data Frameworks at the Very Large Data Bases Conference (VLDB) earlier this year. Additionally, Intel in collaboration with Julia
Computing and MIT, has managed to significantly speed up Julia* programs both at the node-level and on clusters. Underneath the source code, ParallelAccelerator and the High Performance Analytics Toolkit (HPAT) turn programs written in productivity languages (such as Julia and Python*) into highly performant code. These have been released as open source projects to help those in academia and industry to push advanced AI runtime capabilities even further.
Section 2: Customer Use Cases
Boston Children’s Hospital
Simon Warfield, Director of the Computational Radiology Laboratory at Boston Children’s Hospital and Professor of Radiology at Harvard Medical School, is working to use AI to better diagnose various forms of damage in the human brain including concussions, autism spectrum disorder, and Pediatric Onset Multiple Sclerosis. The end goal will be a tool that allows clinicians to look directly at the brain rather than attempting to diagnose problems—and efficacy of treatment—by somewhat subjective assessments of cognitive behavior.
Collecting and processing the data
Professor Warfield and his team have created a Diffusion Compartment Imaging (DCI) technique which is able to extract clinically relevant information about soft tissues in the brain. DCI is
an improvement to Diffusion-Weighted Imaging (DWI) that works by using pulsing magnetic field gradients during an MRI scan to measure the atomic-scale movements of water molecules
(called Brownian motion). The differential magnitude and direction of these movements provides the raw data needed to identify microstructures in the brain and to diagnose the integrity and health of neural tracts and other soft-tissue components.
However, there are several challenges to make DCI a clinically useful tool. For example, it can take about an hour to perform a single DCI scan when aiming at a very high spatial resolution, or about 10 minutes for most clinical applications. Before optimization, it took over 40 hours to process the tens of gigabytes of resulting water diffusion data.
Such long processing time made DCI completely unusable in many emergency situations and further it made DCI challenging to fit into the workflow of today’s radiology departments. After optimization, it now takes about an hour to process the data.
Professor Warfield and Boston Children’s Hospital have been an Intel Parallel Computing Center (IPPC) for several years. This gave Professor Warfield’s team an extraordinary opportunity to optimize their code.
The results of the optimization work were transformative, providing a remarkable 75X performance improvement when running on Intel Xeon processors and a 161X improvement running on Intel Xeon Phi processors. A complete DCI study can now be completed in 16 minutes on a workstation, which means DCI can now be used in emergency situations, in a clinical setting, and to evaluate the efficacy of treatment. Even better, higher resolution images can be produced because the optimized code scales.
Volume inferencing to find brain damage
Data processing isn’t the only challenge with using DCI in clinical settings as a scan typically generates tens to hundreds of images. Many radiologists are finding it difficult to keep up with the increasing volume of work.
Professor Warfield believes an AI is a solution to this problem. The vision is to train a model that can automatically and quickly sort through hundreds of images to pinpoint those that differ
from comparable images of a healthy brain. The goal is to provide the radiologist with the essential highlights of a complex study, identifying not only the most relevant images, but also
pinpointing the critical areas on those images.
Such pinpoint inferencing can be seen in the next case study.
Detecting cancer cells more efficiently
Cancer cells are to be detected and classified more efficiently and accurately, using groundbreaking artificial intelligence—thanks to a new collaboration between the University of Warwick, Intel Corporation, the Alan Turing Institute and University Hospitals Coventry & Warwickshire NHS Trust (UHCW). The collaboration is part of The Alan Turing Institute’s strategic partnership with Intel.
Scientists at the University of Warwick’s Tissue Image Analytics (TIA) Laboratory—led by Professor Nasir Rajpoot from the Department of Computer Science —are creating a large, digital
repository of a variety of tumour and immune cells found in thousands of human tissue samples, and are developing algorithms to recognize these cells automatically.
Figure 4: Cancer tumor cells are highlighted in red
(Source: University of Warwick)
Professor Rajpoot commented: “The collaboration will enable us to benefit from world-class computer science expertise at Intel with the aim of optimizing our digital pathology image analysis software pipeline and deploying some of the latest cutting-edge technologies developed in our lab for computer-assisted diagnosis and grading of cancer.”
The digital pathology imaging solution aims to enable pathologists to increase their accuracy and reliability in analyzing cancerous tissue specimens over what can be achieved with existing methods.
“We have long known that important aspects of cellular pathology can be done faster with computers than by humans,” said Professor David Snead, clinical lead for cellular pathology and
director of the UHCW Centre of Excellence.
Carnegie Mellon University and learning in a limited information environment
Being able to collect large amounts of data from a simulation or via automated empirical observations like gene sequencers is a luxury that is not available to some HPC scientists. AI that can compete against and beat humans in limited information games has great potential, because so many activities between humans happen in the context of limited information such as financial trading and negotiations and even the much simpler task of buying a home.
Some scientists are utilizing games and game theory to address the challenges of learning in a limited information environment. In particular the Nash equilibrium helps augment information about behavior based on rational player behavior rather than by the traditional AI approach of collecting extensive amounts of data. In addition, this approach can address issues of imperfect recall, which poses a real problem for data-driven machine learning because human forgetfulness can invalidate much of a training set of past sampled behavioral data.
“In the area of game theory,” said Professor Tuomas Sandholm (Professor, Carnegie Mellon University), “No-Limit Heads-Up Texas Hold ‘em is the holy grail of imperfect-information games.” In Texas Hold ‘em, players have a limited amount of information available about the opponents’ cards because some cards are played on the table and some are held privately in each player’s hand. The private cards represent hidden information that each player uses to devise their strategy. Thus, it is reasonable to assume that each player’s strategy is rational and designed to win the greatest amount of chips.
In January of this year, some of the world’s top poker champions were challenged to 20 days of No-limit Heads-up Texas Hold ‘em poker at the Brains versus Artificial Intelligence tournament in Pittsburgh’s Rivers Casino. Libratus, the artificial intelligence (AI) engine designed by Professor Sandholm and his graduate student, beat all four opponents, taking away more than $1.7 million in chips.
Libratus is an AI system designed to learn in a limited information environment. It consists of three modules:
- A module that uses game-theoretic reasoning, just with the rules of the game as input, ahead of the actual game, to compute a blueprint strategy for playing the game. According
to Sandholm, “It uses Nash equilibrium approximation, but with a new version of the Monte Carlo counterfactual regret minimization (MCCRM) algorithm that makes the MCCRM
faster, and it also mitigates the issue of involving imperfect recall abstraction.”
- An subgame solver that recalculates strategy on the fly so the software can refine its blueprint strategy with each move.
- A post-play analyzer that reviews the opponent’s plays that exploited holes in Libratus’s strategies. “Then, Libratus recomputes a more refined strategy for those parts of the stage space, essentially finding and fixing the holes in its strategy a module that uses game-theoretic reasoning, just with the rules of the game as input, ahead of the actual game, to compute a blueprint strategy for playing the game,” stated Sandholm.
This is definitely an HPC problem. During tournament play, Libratus consumed 600 of Pittsburg Supercomputing Center’s Bridges supercomputer nodes, it utilized 400 nodes for the endgame solver, and 200 for the self-learning module (e.g. the third module). At night, Libratus consumed all 600 nodes for the post-play analysis.
Sandholm also mentioned the general applicability of Libratus beyond poker: “The algorithms in Libratus are not poker-specific at all, the research involved in finding, designing, and developing application independent algorithms for imperfect information games is directly applicable to other two player, zero sum games, like cyber security, bilateral negotiation in an adversarial situation, or military planning between us and an enemy.” For example, Sandholm and his team have been working since 2012 to steer evolutionary and biological adaptation to trap diseases. One example is to use a sequence of treatments so that T-cells have a better chance of fighting a disease.
Deepsense.ai and reinforcement learning
A team from deepsense.ai utilized asynchronous reinforcement learning and TensorFlow running on CPUs to learn to play classic Atari games. The idea, described on the deepsense.ai blog, was to use multiple concurrent environments to speed up the training process while interacting with these real-time games. “This work provides a clear indication of the capabilities of modern AI and has implications for using deep learning and image recognition techniques to learn in other live visual environments”, explains Robert Bogucki, CSO at deepsense.ai.
The key finding is that by distributing the BA3C reinforcement learning algorithm, the team was able to make an agent rapidly teach itself to play a wide range of Atari games, by just looking at the raw pixel output from the game emulator. The experiments were distributed across 64 machines, each of which had 12 Intel CPU cores. In the game of Breakout, the agent achieved a
superhuman score in just 20 minutes, which is a significant reduction of the single-machine implementation learning time.
Training for Breakout on a single computer takes around 15 hours, bringing the distributed implementation very close to the theoretical scaling (assuming computational power is maximized, using 64 times more CPUs should yield a 64-fold speed-up). The graph below shows the scaling for different numbers of machines. Moreover, the algorithm exhibits robust results on many Atari environments, meaning that it is not only fast, but can also adapt to various learning tasks.
Figure 5: Mean time DBA3C algorithm needed to achieve a score of 300 in
Breakout (an average score of 300 needs to be obtained in 50 consecutive
tries). The green line shows the theoretical scaling in reference to a single
Section 3: The Software Ecosystem
The AI software ecosystem is rapidly expanding with research breakthroughs being quickly integrated into popular software packages (TensorFlow, Caffe, etcetera) and productivity languages (Python, Julia, R, Java*, and more) in a scalable and hardware agnostic fashion. In short, AI software must be easy to deploy, should run anywhere, and should leverage human expertise rather than forcing the creation of a “one-off” application.
This whitepaper briefly touched on how Intel is conducting research to help bridge the gap and bring about the much needed HPC-AI convergence. Additional IPCC research insights and scientific publications are available on the IPCC Web resources website.
Even the best research is for naught if HPC and data scientists cannot use the new technology. This is why scalability and performance breakthroughs are being quickly integrated into
performance libraries such as Intel® Math Kernel Library – Deep Neural Networks (Intel® MKL-DNN) and Intel® Nervana™ Graph™.
The performance of productivity languages such as Python, Julia, Java, R and more is increasing by leaps and bounds. These performance increases benefit data scientists and all aspects of AI from data preprocessing to training as well as inference and interpretation of the results. Julia, for example, recently delivered a peak performance of 1.54 petaflops using 1.3 million threads on 9,300 Intel Xeon Phi processor nodes of the Cori supercomputer at NERSC. The Celeste project utilized a code written entirely in Julia that processed approximately 178 terabytes of celestial image data and produced estimates for 188 million stars and galaxies in 14.6 minutes.
Figure 6: Speedup of Julia for deep learning when using Intel®
Math Kernel Library (Intel® MKL) vs. the standard software stack
Jeff Regier, a postdoctoral researcher in UC Berkeley’s Department of Electrical Engineering and Computer Sciences explained the Celeste AI effort: “Both the predictions and the uncertainties are based on a Bayesian model, inferred by a technique called Variational Bayes. To date, Celeste has estimated more than 8 billion parameters based on 100 times more data than any previous reported application of Variational Bayes.” Baysian models are a form of machine learning used by data scientists in the AI community.
Intel® NervanaTM Graph: a scalable intermediate language
Intel Nervana Graph is being developed as a common intermediate representation for popular machine learning packages that will be scalable and able run across a wide variety of hardware from CPUs, GPUs, FPGAs, Neural Network Processors and more. Jason Knight (CTO office, Intel Nervana) wants people to view Intel Nervana Graph as a form of LLVM (Low Level Virtual Machine). Many people use LLVM without knowing it when they compile their software as it supports a wide range of language frontends and hardware backends.
Knight writes, “We see the Intel Nervana Graph project as the beginning of an ecosystem of optimization passes, hardware backends and frontend connectors to popular deep learning
frameworks,” as shown in Figure 7 using TensorFlow as an example.
Intel Nervana Graph also supports Caffe models with command line emulation and Python converter. Support for distributed training is currently being added along with support for
multiple hosts so data and HPC scientists can address big, complex training sets even on leadership class supercomputers.
High performance can be achieved across a wide range of hardware devices because optimizations can be performed on the hardware agnostic frontend dataflow graphs when generating runnable code for the hardware specific backend.
Figure 8 shows how memory usage can be reduced by five to six times. Memory performance is arguably the biggest limitation of AI performance so a 5x to 6x reduction in memory use is significant. Nervana cautions that these are preliminary results and “there are more improvements still to come”.
Intel Nervana Graph also leverages the highly optimized Intel MKL-DNN library both through direct calls and pattern matching operations that can detect and generate fused calls to Intel® Math Kernel Library (Intel® MKL) and Intel MKL-DNN even in very complex data graphs. To help even further, Intel has introduced a higher level language called neon™ that is both
powerful in its own right, and can be used as a reference implementation for TensorFlow and other developers of AI frameworks.
Figure 8: Memory optimizations in Intel® NervanaTM Graph
Figure 9 shows performance improvements that can be achieved on the new Intel® Xeon® Scalable Processors using neon.
Figure 9: Intel® Xeon® Scalable processor performance improvements
An equally important challenge in the convergence of HPC and AI is closing the gap between data scientists and AI programming models. This is why incorporating scalable and efficient AI into productivity languages is a requirement. Most data scientists use Python, Julia, R, Java* and others to perform their work.
HPC programmers can be “parallel programming ninjas”, but data scientists mainly use popular frameworks and productivity languages. Dubeyobserves, “Software must address the challengeof delivering scalable, HPC-like performance for AI applications without the need to train data scientists in low-level parallel programming.”
Unlike a traditional HPC programmer who is well versed in low-level APIs for parallel and distributed programming, such as OpenMP* or MPI, a typical data scientist who trains neural networks on a supercomputer is likely only familiar with high-level scripting-language based frameworks like Caffe or TensorFlow.
However, the hardware and software for these devices is rapidly evolving, so it is important to procure wisely for the future without incurring technology or vendor lock in. This is why the
AI team at Intel is focusing their efforts on having Intel Nervana Graph called from popular machine learning libraries and packages. Productivity languages and packages that support Intel
Nervana Graph will have the ability to support future hardware offerings from Intel ranging from CPUs to FPGAs, custom ASIC offerings, and more.
Section 4: Hardware to Support the AI Software Ecosystem
Balance ratios are key to understanding the plethora of hardware solutions that are being developed or are soon to become available. Future proofing procurements to support run anywhere solutions—rather than hardware specific solutions—is key!
The basic idea behind balance ratios is to keep what works and improve on those hardware characteristics when possible. Current hardware solutions tend to be memory bandwidth bound.
Thus, the flop/s per memory bandwidth balance ratio is critical for training. So long as there is sufficient arithmetic capability to support
the memory (and cache bandwidth), the hardware will deliver the best average sustained performance possible. Similarly, the flop/s per network performance is critical for scaling data
preprocessing and training runs. Storage IOP/s (IO Operations per Second) is critical for performing irregular accesses in storage when working with unstructured data.
The future of AI includes CPUs, accelerators/purpose-built hardware, FPGAs and future neuromorphic chips. For example, Intel’s CEO Brian Krzanich said the company is fully committed to making its silicon the “platform of choice” for AI developers. Thus, Intel efforts include:
- CPUs: including the Intel Xeon Scalable processor family for evolving AI workloads, as well as Intel Xeon Phi processors.
- Special purpose-built silicon for AI training such as the Intel® Neural Network Processor family.
- Intel FPGAs, which can serve as programmable accelerators for inference.
- Neuromorphic chips such as Loihi neuromorphic chips.
- Intel 17-Qubit Superconducting Chip, Intel’s next step in quantum computing.
Although a nascent technology, it is worth noting it because machine learning can potentially be mapped to quantum computers. If this comes to fruition, such a hybrid system could
revolutionize the field.
CPUs such as Intel® Xeon® Scalable processors and Intel® Xeon Phi™
The new Intel Xeon Scalable processors deliver a 1.65x on average performance increase for a range of HPC workloads over previous Intel Xeon processors due to a host of microarchitecture improvements that include: support for Intel® Advanced Vector Extensions-512 (Intel® AVX-512) wide vector instructions, up to 28 cores or 56 threads per socket, support for up to eight socket systems, an additional two memory channels, support for DDR4 2666 MHz memory, and more.
The Intel Xeon Phi product family presses the limits of CPU-based power efficiency coupled with many-core parallelism: two per-core Intel AVX-512 vector units that can make full use of the high-bandwidth HMB2 stacked memory. The 15 PF/s distributed deep learning result utilized the Intel Xeon Phi processor nodes on the NERSC Cori supercomputer.
Intel® Neural Network Processor
Intel acquired Nervana Systems in August 2016. As previously discussed, Intel Nervana Graph will act as a hardware agnostic software intermediate language that can provide hardware specific optimized performance.
On October 17, 2017, Intel announced it will ship the industry’s first silicon for neural network processing, the Intel® Nervana™ Neural Network Processor (NNP), before the end of this year. Intel’s CEO Brian Krzanich stated at the WSJDlive global technology conference, “We have multiple generations of Intel Nervana NNP products in the pipeline that will deliver higher performance and enable new levels of scalability for AI models. This puts us on track to exceed the goal we set last year of achieving 100 times greater AI performance by 2020”.
Intel’s R&D investments include hardware, data algorithms and analytics, acquisitions, and technology advancements. At this time, Intel has invested $1 Billion in the AI ecosystem.
Figure 10: Intel’s broad investments in AI
The Intel® Neural Network Processor is an ASIC featuring 32 GB of HBM2 memory and 8 Terabits per Second Memory Access Speeds. A new architecture will be used to increase the parallelism of the arithmetic operations by an order of magnitude. Thus, the arithmetic capability will balance the performance capability of the stacked memory.
The Intel Neural Network Processor will also be scalable as it will feature 12 bidirectional high-bandwidth links and seamless data transfers. These proprietary inter-chip links will provide bandwidth up to 20 times faster than PCI Express* links.
FPGAs are natural candidates for high speed, low-latency inference operations. Unlike CPUs or GPUs, FPGAs can be programmed to implement just the logic required to perform the inferencing operation and with the minimum necessary arithmetic precision. This is referred to as a persistent neural network.
Figure 11: From Intel® Caffe to FPGA (source Intel )
Intel provides a direct path from a machine learning package like Intel® Caffe through Intel MKL-DNN to simplify specification of the inference neural network. For large scale, low-latency FPGA deployments, see Microsoft Azure.
As part of an effort within Intel Labs, Intel has developed a self-learning neuromorphic chip—codenamed Loihi—that draws inspiration from how neurons in biological brains learn to operate based on various modes of feedback from the environment. The self-learning chips uses asynchronous spiking instead of the activation functions used in current machine and deep
learning neural networks.
Loihi has digital circuits mimicking the basic mechanics of the brain, corporate vice president and managing director of Intel Labs Dr. Michael Mayberry said in a blog post, which requires lower compute power while making machine learning more efficient. These chips can help “computers to self-organize and make decisions based on patterns and associations,” Mayberry explained. Thus, neuromorphic chips hold the potential to leapfrog current technologies.
Figure 12: A Loihi
manufactured by Intel
Network: Intel® Omni-Path Architecture (Intel® OPA)
Intel® Omni-Path Architecture (Intel® OPA), a building block of Intel® Scalable System Framework (Intel® SSF), is designed to meet the performance, scalability, and cost models
required for both HPC and deep learning systems. Intel OPA delivers high bandwidth, high message rates, low latency, and high reliability for fast communication across multiple nodes
efficiently, reducing time to train.
To reduce cluster-related costs, Intel has launched a series of Intel Xeon Platinum and Intel Xeon Gold processor SKUs with Intel OPA integrated onto the processor package to provide access to this fast, low-latency 100Gbps fabric. Further, the onpackage fabric interface sits on a dedicated internal PCI Express bus, which should provide more IO flexibility. AI applications tend to be network latency limited due to the many small messages that are communicated during the reduction operation.
Figure 13: Intel® Scalable Systems Framework
Just as with main memory, storage performance is dictated by throughput and latency. Solid State storage (coupled with distributed file systems such as Lustre) is one of the biggest developments in unstructured data analysis.
Instead of being able to perform a few hundred IOP/s, an SSD device can perform over half a million random IO operations per second. This makes managing big data feasible as discussed
in ‘Run Anywhere’ Enterprise Analytics and HPC Converge at TACC where researchers are literally working with an exabyte of unstructured data on the TACC Wrangler data intensive supercomputer. Wrangler provides a large-scale storage tier for analytics that delivers a bandwidth of 1TB/s and 250M IOP/s.
Succinctly, “data is king” in the AI world, which means that many data scientists—especially those in the medical and enterprise fields—need to run locally to ensure data security. When required, the software should support a “run anywhere capability” that allows users to burst into the cloud when additional resources are required.
Training is a “big data” computationally intense operation. Hence the focus on the development of software packages and libraries that can run anywhere as well as intermediate representations of AI workloads that can create highly optimized hardware specific executables. This, coupled with new AI specific hardware such as Intel Neural Network Processors promise performance gains beyond the capabilities of both CPUs and GPUs. Even with greatly accelerated roadmaps, the demand and capabilities for AI solutions is strong enough that it still behooves data scientists to procure high performance hardware now—rather than waiting for the ultimate hardware solutions. This is one of the motivations behind Intel Xeon Scalable Processors.
Lessons learned: don’t forget to include data preprocessing in the procurement decision as this portion of the AI workload can consume significant human time, computer memory, and
both storage performance and storage capacity. Productivity languages such as Julia and Python that can create hundreds of terabyte training sets are important for leveraging both hardware and data scientist expertise. Meanwhile, FPGA solutions and ‘persistent neural networks’ can perform inferencing with lower latencies than either CPUs or GPUs, and support even massive cloud based inferencing workloads as demonstrated by Microsoft.
For more information about Intel technologies and solutions for HPC and AI, visit intel.com/HPC.
To learn more about Intel AI solutions, go to intel.com/AI.
Rob Farber is a global technology consultant and author with an extensive background in HPC and in developing machine learning technology that he applies at national labs and commercial organizations. Rob can be reached at email@example.com
Download the file below.