Apache Spark: the Next Logical Step in Big Data Analytics
Some folks are never satisfied. No matter how quickly Hadoop processes their Big Data stores, they could use faster results, easier coding and sophisticated analytics. Is that you? If so, no worries! There’s a framework designed for developers who want more advanced analytics with less time and effort.
It’s Apache™ Spark.
More of What You Want with Less Time and Effort
Better analytics with fewer resources—that’s a big claim, but Spark backs it up with faster speeds and advanced processing capabilities.
But what is Spark? It:
- Is an open-source cluster computing framework.
- Is written in the Scala Programming Language.
- Runs in the Java Virtual Machine (JVM) environment.
- Uses the Resilient Distributed Dataset (RDD) application programming interface.
Spark was designed to improve on Big Data analytics capabilities available in other software. These improvements include:
- Iterative algorithms used for machine learning.
- Interactive data mining and processing capabilities.
- Faster processing speeds, up to 100 times faster than Apache Hive. This is enabled by an in-memory, fully Apache Hive-compatible data warehousing system.
- Streaming data analysis, which includes log processing and fraud detection of live streams and provides alerts, aggregates and analysis
- Sensor data processing, in which data is fetched and joined from many sources.
But what’s under the hood? A powerful API and plenty of libraries.
An API that Does the Heavy Lifting and Libraries Galore
Many companies have adopted the data lake concept, in which they gather all enterprise data from disparate sources into one data management structure. Processing this data within Spark is the next logical step of Big Data analytics, given that data processing is fast due to in-memory computations. As part of the data lake concept, several companies have made significant commitments to HDFS, HBase, the ORC file format and other storage abstractions in the Hadoop ecosystem.
Spark has first-class support for external data sources. And, it can run directly on the cluster in YARN—that’s where companies want to perform their data analysis.
Efficient data moving. With the DataSource API, Spark provides a powerful way to move data from external sources and use these systems for processes such as filtering and predicate pushdown. You can use the DataSource API to bring data into Spark very efficiently. The value of a data lake is that you can more data under one roof and opens new opportunity for insights and to drive efficiency in such scenarios, Spark helps tap into this value.
Spark is fast becoming a cornerstone of many enterprises’ data platform strategy. As such it needs to meet the familiar enterprise DevOps, security, stability, install and upgrade requirements. Hortonworks has already delivered Ambari stack definition to provide an easy install and upgrade experience for Spark on HDP.
Versatile, built-in libraries. Spark also includes these libraries which are part of the Spark ecosystem:
- Spark Streaming uses Spark core fast-scheduling capability to perform streaming analytics.
- Spark SQL exposes the Spark datasets over the JDBC API and enables you to run SQL-like queries on Spark with traditional BI and visualization tools.
- Spark MLlib is the scalable Spark machine learning library. It consists of underlying optimization primitives as well as common learning algorithms and utilities such as classification, regression, clustering, collaborative filtering and dimensionality reduction.
- Spark GraphX is the new Spark API for graphs and graph parallel computation. GraphX exposes a set of fundamental operators as well as an optimized variant of the Pregel API.
Spark offers many capabilities. But what can it do in the field?
Life in the Fast Lane: Five Spark Use Cases
Although Spark is not a mature technology, it’s already shown promise in real-life applications. The following use cases show how Spark provides more processing speed and capabilities with less developer time and effort.
- Banking: Credit card fraud detection
Credit card companies and banks are constantly on the lookout to root out fraud. They have plenty of mathematical models. What they lack is the ability to apply those models in real time. Apache Spark Streaming on Hadoop enables credit card companies to check purchase information against a database of bogus transactions. If a match is found, a real-time trigger notifies an employee to follow up the transaction. When not in use, the Hadoop-stored data is used to automatically update the mathematical models that run in the background.
- Spark SQL on HBase – This technology provides with scalable and reliable Spark SQL/DataFrame access to NOSQL data in HBase, through HBase’s “native” data access APIs. HBase pushdown capabilities, in forms of projection pruning, coprocessor and custom filtering, are optimally utilized to support ultra low latency processing. A novel technique based upon partial evaluation is introduced to process virtually arbitrarily complex logic predicates to precisely prune partitions, generate partition-specific predicates, and enable intelligent jumps in scans over multidimensional data sets. Overall the system is designed for ad hoc, interactive queries. The current version of 0.1.0 runs on Apache Spark 1.4.0 release.
- Banking: Network security
A global managed security services provider takes advantage of robust Spark capabilities to detect malicious, real-time activity. By using components such as Spark SQL, the security provider looks for traces of potentially harmful activity. First, it uses Spark Streaming to compare and check against existing threats. Then, it passes the information onto the storage platform for further processing with Spark ML Lib and Spark GraphX.
- Biotechnology: Genomic sequencing
By using the distributed storage and compute power of Spark on Hadoop, biotech companies drastically reduce the time it takes to process genome data. By using Spark’s lightning-fast processing power, these companies have reduced the time it takes to analyze genetic information from weeks to hours. Although these are not real-time capabilities, they deliver significant benefits to medical researchers and product developers. Specifically, Spark Streaming collects the genome codes and logs them for data modelling. The data flow in Spark Streaming may vary in this use-case. The logs are then examined using Spark SQL. Next step is to create a training model using a specific machine learning algorithm. This algorithm can be easily applied, by using Spark MLlib libraries. After the training model has been generated, it is deployed for obtaining a particular genome sequence.
- Advertising: Customer behavior targeting
An advertising company has built a real-time ad-targeting system. Using a Spark, based on MapR-DB database, the firm’s platform matches customer behavior with previously observed patterns and decides what type of ad to show site visitors online. This enables more precise targeting and a higher return on investment because the users exposed to the ad are the ones that are most likely to make a purchase.
- Marketing: User recommendations
Data scientists can use Spark Streaming and Spark MLlib capabilities to make user recommendations that fit consumer tastes and buying histories. Companies such as Spotify and Netflix use live-stream clicks and user preferences to update their recommendation engine every few seconds. Spark’s in-memory computation helps organizations process huge chunks of data in a few seconds. And Spark SQL reads various data schema and stores and retrieves data easily.
- MapR: https://www.mapr.com/blog/game-changing-real-time-use-cases-apache-spark-on-hadoop
- Hortonworks: http://hortonworks.com/blog/spark-hdp-perfect-together/
- Spark: http://spark.apache.org/documentation.html
Get a deeper understanding of the Apache Spark framework for Big Data analysis, with our white paper: