Magazine Article | July 2, 2018

How To Capitalize On GPU Computing

By Craig Swanson, President, Applied Data Systems

As machine learning, analytics, and AI expand, channel partners can get in early and leverage a hidden opportunity to tune storage systems for the desired ROI – and competitive advantage.

Graphics processing unit (GPU) computing is hot – driven by the demand for high-performance computing resources to support high-growth applications including machine learning, advanced analytics, genomics, and artificial intelligence (AI). New GPU processors are delivering four to 40 times the performance of traditional CPUs – while major chip vendors are rapidly debuting GPUs of their own. As an example, an entry-level GPU provides 32 times the processing performance of a standard CPU – with more advanced GPUs providing 40 times the performance of a CPU and additional features for specific uses such as inference capability and enhanced energy efficiency.

Since switching to a GPU server impacts the other networking and storage resources in a computing deployment, enterprises need a trusted partner that can identify and eliminate bottlenecks to provide a competitive advantage. Channel players are exactly that trusted expert who can optimize performance for each customer’s unique environment. Midtier partners in particular have an excellent opportunity to ride the adoption wave, since target enterprises are small to midsize spinouts from larger research labs or VC-backed startups offering a better way via technology.

With AI spending in 2017 growing 54 percent over the previous year (IDC) and machine learning projected to grow 44 percent by 2020 (Markets and Markets), the channel can ride the wave of a genuinely transformational deployment category — with the right preparation.

WHY THE CHANNEL?

GPU chips were originally used for 3D game rendering. Today, enterprise-grade GPUs are being harnessed more broadly to accelerate computational workloads. They’re optimized for taking huge batches of data and performing the same operation over and over very quickly, much more efficiently than CPUs. When a server uses GPUs, it can deliver breakthrough performance that can decrease the number of servers to be purchased — meaning teams can achieve faster time to insights, often at dramatically lower costs.

This veritable firehose of processing performance is perfect for a channel sale for several reasons:

  • Total systems focus is essential. Anybody can build and ship a server or sell high-performance computing (HPC) infrastructure components. However, harnessing the raw power requires tuning the balance of CPUs vs. GPU servers, DRAM in them, networking, primary and secondary storage and software elements that allow its performance to be experienced.
  • Internal IT resources are often lean, and many system integrators find that an end-user manager or a business unit lead is the key decision maker on a highly technical solution. It takes a channel partner with a focus on GPU computing, storage, and networking to properly deliver a to-Tal solution that takes into account the nuances of GPU computing.
  • Incumbent vendors, even those with professional services units, can suffer from the hammer-andnail syndrome, as in “I sell hammers, and everything looks like a nail to me.” Channel partners, in contrast, specialize in turnkey approaches to solving business problems in a timely manner and have cross-vendor integration specialties that are essential in going beyond delivering the promised performance, to achieving the underlying business goals that drove the technology purchase in the first place.

THE GOTCHA THAT GETS OVERLOOKED

The market tends to place so much emphasis on GPUs’ raw performance that it forgets about the other essentials: Customers need to store all the data they generate from computation and access it rapidly enough to run these advanced applications. High performance computing (HPC) systems like this typically involve petabytes of data. Beyond solving for capacity, you must also ensure that the storage performance keeps up with the requirements of the GPUs.

Why? Again the “drinking from the firehose” metaphor applies here. System architects strive to keep those GPUs busy 100 percent of the time. Yet the storage often can’t keep up with the processing speed. Bottlenecks created by storage systems quickly start impacting time to completion. The larger the data sets grow, GPUs are more likely to sit idle, waiting for storage reads and writes, exacerbating the issue caused by poor storage performance.

NVME FLASH AND EXISTING SPINNING-DISK STORAGE

The storage industry has witnessed an explosion of change, key among them the growth in use of solid state disks (SSDs), whose lack of moving parts enable superior reliability record and superior performance. In addition, the emergence of NVMe (non-volatile memory express) SSDs, a technology that accelerates performance and reduces latency, is key to solving the data storage issues in a GPU computing environment. NVMe delivers better streaming performance and 40 times better latency compared to hard disk drives (HDDs). Most major storage vendors have embraced NVMe, and a number of startups have arisen that enable distributed NVMe, further boosting system efficiencies.

In an increasing number of HPC scenarios, architectures are using GPU-based systems that require higher input/output operations per second (IOPs) as well as greater bandwidth. All flash arrays (AFAs) are considered as a solution; however, they do not solve all of the storage challenges. While they help address the performance issues, they are not cost-effective at scale. With increasingly large data sets, a hybrid approach is required to deliver performance at a reasonable price point. This involves multiple tiers of storage under a single file system with intelligent tiering to ensure hot data is stored on the fastest tier when needed. Inactive data is then stored on low-cost spinning disk until it is called for GPU processing.

There is a sizable role for the channel to architect and deliver custom GPU-based systems depending on the customer’s specific use. AI, machine learning, and other demanding analytics applications — all have unique requirements, and an experienced and trusted partner is crucial to success. The approach avoids performance bottlenecks and enables almost 100 percent GPU utilization. This is made possible with the Tier 0 storage layer, where Excelero NVMesh software aggregates distributed NVMe storage to deliver near local NVMe drive performance in a centralized storage system. The design accommodates all stages of the machine learning process from data collection to GPU processing and finally persistent storage of data.

HARNESSING THE REVOLUTION

As more companies deploy AI, machine learning, and advanced analytics, the volumes of data being stored and orchestrated are exponentially growing. How can channel partners leverage the emerging GPU computing opportunity?

  1. By providing expertise to deliver balanced systems of processors, servers, networking, and primary and secondary storage — systems that have been designed without bottlenecks so they smoothly orchestrate all of the data flowing in and out of these HPC/AI clusters.
  2. By sharing best practices in key industries where GPU computing is taking off — practices that involve a deep understanding of the characteristics of both on-premise and cloud/XaaS solutions, how to make them perform optimally, and how to integrate them into existing technology.
  3. By empowering the customer to focus on what they know best — achieving scientific, medical, engineering, and other breakthroughs — and ensuring their GPU computing infrastructure is not another science experiment.

CRAIG SWANSON, president at Applied Data Systems, has spent over 17 years helping build supercomputers and storage systems for major researchers and enterprises.