What is big data?
Big data isn’t simply the amount of data in a system, but a group of qualifying parts such as its SSD architecture. These include a variety of structured and unstructured data — both database data and individual files, like Word documents or PDF files. Amounts of data can vary from hundreds of gigabytes to hundreds of terabytes, with really large systems reaching hundreds of petabytes. It can include real-time data, transactional databases, batch data from ERP and CRM systems and data from processes and streams. The data should also be evaluated for trustworthiness, authenticity, origin, reputation and value.
Once you’ve collected data from many sources, you’ll need to run analysis, not only to verify data and weed out duplicates or data that doesn’t match your needs; but also to collect the good data and search through it for correlations, discover or predict trends, collect statistics and more. Typical big data architectures consist of anywhere from dozens to hundreds of nodes, with data spread across multiple nodes as needed.
Where Are SSDs Most Useful in Big Data Architecture?
While it might be tempting to replace all the hard disk drives in a big data system with solid state drives (SSDs), this would usually be prohibitively expensive, and not necessarily all that much faster than putting SSDs where they’ll do the most good. Typically, both data files and the processes that access the data are spread across multiple nodes.
Some processes are relatively IO intensive, while others are compute intensive. IO-intensive operations can be accelerated substantially by using an SSD architecture. Considering compute-intensive operations, if the data will fit in memory, SSDs will typically not produce a large benefit. However, if there is a large amount of data to be processed, switching that data in and out of memory to hard disk causes the CPUs to spend large amounts of time waiting for the next piece of data. However, the higher speeds of PCIe-based non-volatile memory express (NVMe) such as the Samsung 850 PRO can keep computation from becoming disk bound.
How to Improve Big Data
There is no single best practice for improving performance with big data. With Hadoop, for example, there are several types of operations. One involves small processes that can complete on a single small node, such as smaller (under 100 GB) tasks. Another type of operation (such as Bayesian classification and machine learning) involves chained Hadoop jobs, which will run faster on larger nodes with a terabyte or more of RAM and lots of CPU cores. A third type of operation involves large processes that are split up and run on many nodes at once.
Each type of operation can benefit from SSDs of different types, capacities and speeds. For compute-intensive tasks on large nodes, NVMe SSDs such as the Samsung 950 PRO bring the speed necessary to keep from slowing the processing. For small processes on a small number of nodes, the higher capacity of SATA or SAS SSDs can be more effective. For the small processes on large numbers of nodes, transfers between nodes tend to be limited by network speeds rather than disk IO, so a hybrid approach of a few SSDs and more HDs, or hybrid drives with both SSD and HD in the same form factor may be the most efficient and cost-effective option.
Identifying Pain Points
Big data systems that use virtualization are more flexible in the types of jobs they can run, but can also be more affected by disk IO, since multiple virtual machines may be trying to access the same data at the same time. Reducing contention for resources with really fast SSDs can help reduce performance degradation. Some systems go even further than SSDs with non-volatile RAM, which have the potential to provide nodes with terabytes of RAM at near-RAM speeds, allowing for more data to be processed without having to swap to disk.
Using the right kind of SSD in the right place in a big data system can produce big results — improvements in overall performance of 70 percent are not unusual when replacing HDs with SSDs. Removing bottlenecks in computer loads could result in even higher gains, using NVMe or other fast SSD technologies.
Choosing the right SSDs is critical for your business. Read our white paper to learn how the right SSDs can enable you to harness the power of big data.