The Challenge
Exxact Corporation performs benchmarking of multiple High Performance Computing (HPC) software applications in efforts to characterize how specific apps will perform on their systems. More specifically, when their systems include multiple NVIDIA Tesla Graphics Processing Units (GPUs). Recently, Exxact engineers have been characterizing the performance of life-science applications such as RELION, GROMACS, NAMD and Amber, all of which are molecular dynamic simulation applications that model biochemical processes for life science research. These applications run best when leveraging NVIDIA’s CUDA-enabled GPUs. In most cases, application processing gets divided across multiple NVIDIA Tesla GPUs in a single system. The test system originally had a single Samsung solid state drive (SSD), but test results showed scaling from one to two and from four to eight GPUs was nowhere near linear. Furthermore, performance with four or eight GPUs was nowhere near the expected performance of single-GPU performance. In particular, gains achieved when scaling from four to eight GPUs were incremental at best.
The Solution
After a fair amount of investigative work by Exxact engineers, it became evident that performance was being limited by the storage subsystem. Exxact engineers decided to leverage SSD speed by adding multiple Samsung 860 EVO SSDs in a RAID-0 configuration to improve overall performance. RAID-0 stripes data across multiple drives in a storage system, increasing system performance by spreading file system read and write operations across multiple devices which reduces latency and increases throughput.
Using Samsung 860 EVO SSDs in a RAID-0 configuration removed the storage bottleneck that was preventing the clustered application from scaling beyond four GPUs.
The Results
The new configuration removed the file system IO bottleneck and enabled the system to get the most from the additional GPUs, enabling greater performance gains in the eight -GPU configuration. The scaling was closer to linear now, with four GPUs providing close to four times the performance, and performance with eight GPUs much closer to doubling that of four.
Exxact Corporation. is a value-added reseller creating custom high-performance computing systems, Big Data systems, Cloud and Audio-Visual systems for labs, research universities, online retailers and Fortune 100 and Fortune 1000 companies. Among other things, Exxact specializes in servers that support up to ten NVIDIA Tesla GPUs in a single system.
The Challenge
Overcoming Storage Bottlenecks in Complex Simulations
Exxact sells custom systems to labs and universities doing research into life sciences, real-time modeling of biological processes, deep learning, Big Data and more. Being able to sell these systems successfully requires expertise in setting up and optimizing both the hardware systems and software used.
Paul Del Vecchio, a Sr. Sales Engineer, was given the job of creating a demo system that could run various molecular dynamics applications utilizing CUDA-enabled GPUs.
In a similar fashion, the CUDA software is used to run simulations of biological processes across multiple NVIDIA Tesla GPUs in a single system. Dedicated, specialized motherboards and PCIe bus expansion systems allow for eight or more 16x PCIe slots in a single system. Since communications between nodes run over the PCIe bus, some of the usual challenges of clustered systems involving the network that connects the nodes are eliminated.
RELION, GROMACS, NAMD and Amber are all applications that simulate different biological and chemical processes. These simulations are so complex that one of the standard measurements is days per nanosecond (days/ns), which measures how many days it takes to simulate one billionth of a second of a biological system in operation.
Isolating bottlenecks is an ongoing process; once one bottleneck is found and ameliorated, a new bottleneck is usually discovered. Once that choke point is resolved, the next lowest-performing component becomes the bottleneck. Particularly with complex systems like HPC software, optimizing performance becomes a lengthy process, and often must be tailored to not only the clustered operating system, but the specific application that runs on top of it.
The Solution
Additional Samsung SSDs With Simultaneous Write Configuration
Del Vecchio recognized the need to do a complete and thorough characterization of the RELION application, including profiling other tools and methods that would determine exactly where time is spent within the RELION code.
“During the initial phase of testing it was obvious the application would not scale past four GPUs, regardless of what parameters were used at runtime,” Del Vecchio said. “This suggested there was some I/O bottleneck present that was keeping RELION from obtaining better performance with the eight GPU configuration.”
To prove this, he added two more 500GB SSDs to the system and configured all three drives as a RAID-0 stripe set, so that data is written simultaneously to all three drives, improving performance by three times compared to a single drive.
Upon rerunning the benchmark, he was able to achieve a modest performance gain when moving from four to eight GPUs, indicating that the storage bottleneck had been resolved with the additional drives.
While the application does scale up with the addition of GPU compute resources, the gains still aren’t linear; the ideal case would be to double the performance with eight GPUs instead of four. Reconfiguring the disk I/O subsystem did enable scaling up to eight GPUs, but the performance improvement was only incremental even though the GPU compute resources were essentially doubled at each iteration. Fortunately, further work with additional drives or NVMe SSDs may yield greater gains when moving from four to eight GPUs.
The Technology
860 EVO 500GB
The Samsung 860 EVO has sequential write speeds up to 520 MB/s with TurboWrite technology and sequential read speeds up to 540 MB/s with capacities ranging up to 4TB.
970 EVO
Samsung 970 EVO has capacities up to 2TB, offers Intelligent TurboWrite technology, and has read/write speeds of 3,500/2,500 MB/s.
The Results
Scalable Performance Enhances GPU — and User — Productivity
Adding five Samsung 850 EVO SSDs in RAID-0 configuration improved the scalability of the application, as shown in the table below:
- 1 GPU: 131 ns/day
- 2 GPUs: 236 ns/day
- 4 GPUs: 294 ns/day
- 8 GPUs: 384 ns/day
Moving from one to three to five drives enabled the system to make better use of the Tesla cards, and the increase in performance came much closer to a linear increase at each step. Initial testing with one drive showed almost no increase in performance going from four to eight; here, the performance increase is about 30 percent. These measurements are taken with the latest release of GROMACS using the NVIDIA Tesla GPUs. While performance does not double, it is much more of an increase than with the initial one-drive and three-drive systems.
“Running multiple RELION instances on the same node with multiple GPUs by assigning an instance to a specific GPU through use of the gpu_id command line parameter, a user can process multiple RELION 3D Classification jobs in parallel and assign those jobs to dedicated GPU resources that will greatly increase the amount of work that can be done and thus, fully utilize their compute resource,” said Del Vecchio.
Further testing will be done with newer Samsung SSDs such as the 970 PRO and 970 EVO NVMe drives to see if the higher performance NVMe drive can substitute for the five SATA SSDs running in RAID-0. Such a substitution would simplify system design and reduce overall cost, while the NVMe drive would provide throughput and latency improvements over the RAID-0 setup. The single NVMe SSD would not only cost less than five SATA SSDs, but would also produce less heat and draw less power.