Waiting to Join the Mainstream: Apache Spark Use Cases

Until now, you had to be a really big company such as Amazon to have the skills and resources to plow through Big Data and get fast results. Apache Spark™ combines performance and sophistication that usually comes from expensive systems and delivers it to commodity Hadoop cluster.

Now that Spark has been dubbed the “next shiny thing in Big Data,” industry talk is all about Apache Spark use cases and adoption—who will adopt it and when it might enter the Big Data mainstream.

But, to survive inevitable competition, every new software platform must find a niche, in which it excels. Realistically, Spark is still unproven technology. What is application sweet-spot?

Where does Spark excel?

Spark is an open source alternative to MapReduce. It was designed to make it easier to build and run fast, sophisticated applications on Hadoop. But now, it’s making an impressive name for itself as a standalone app.

Spark simplifies the difficult and machine-intensive task of processing high volumes of real-time or archived data. Here are some industry-specific examples, in which it works very well:

  • Gaming industry. Spark discovers patterns of real-time, in-game events and responds to them immediately. It provides value in areas such as player retention, targeted advertising and auto-adjusted game complexity levels.
  • E-commerce Spark can pass Real-time transaction information to a streaming clustering algorithm such as k-means or collaborative filtering such as ALS. Results from these operations can be combined with other unstructured data sources, such as customer comments or product reviews.
  • Financial services. In fraud, intrusion detection system or risk-based authentication situations, Spark can harvest huge amounts of archived logs. And it can combine log data with external data sources such as information about data breaches and compromised accounts.
  • Online news service. The Yahoo news personalization effort uses ML algorithms running on Spark. The app figures out what individual users are interested in, and categorizes news topics. Developers wrote a Spark ML algorithm in 120 lines of Scala. (Previously, this ML news personalization algorithm was written in 15,000 lines of C++.) With just 30 minutes of training on a large, hundred million record data set, the Scala ML algorithm was ready for business.

And for the technically inclined…

In Apache Spark, each component engine works best for a class of technical use cases. Here are some component-related capabilities.

  • Interactive queries. Data analysts want to use their use existing BI tools to view and query advertising analytic data collected in Hadoop. Shark runs the standard Hive server API. So any tool that plugs into Hive automatically works with Shark. This enables users to query their ad visit data interactively.
  • Streaming data. More and more analysts want to stream and analyze data in real time. Spark Streaming, a Spark API extension, enables easy integration of real-time data from disparate event streams.

Spark Streaming uses include sensors, alarms, cluster or fleet management, cyber security, telemetry and diagnostics.

To make these capabilities even more powerful, it’s easy to add Spark MLlib for machine learning pipelines to the streaming data pathways. The table at right shows even more use cases of Spark Streaming.

  • Machine learning. MLlib is a Spark implementation of some common machine learning (ML) capabilities and related tests and data generators. MLlib currently supports four common types of machine learning problems—binary classification, regression, clustering and collaborative filtering, as well as a gradient descent optimization primitive.
  • Fog computing Fogs are clouds, in which the primary processing nodes are network-edge endpoints, such as sensor-laden Internet of Things (IoT) devices. Fogs distribute the storage, bandwidth and other cloud resources out to the IoT endpoints, most of which are embedded deeply in the hardware infrastructure of the end applications.

This works in favor of Spark, which includes an interactive real-time query tool (Shark), a machine-learning library (MLlib), a streaming-analytics engine (Spark Streaming), and a graph-analysis engine (GraphX).

So, what’s the verdict?

It’s good to keep in mind that  enterprises committed to Hadoop framework (often newly acquired) and various NoSQL platforms as their strategic big-data platforms, will not be eager to sign up for a Spark solution.

At least not until it has truly proved its value in plenty of real-world deployments.

Some analysts look to the next two to three years, when Spark will have a chance to mature and provide field-proven, enterprise-grade platforms and a well-developed ecosystem of Spark tools, libraries, and applications.

It seems that Spark enthusiasts will have to wait until field performance boosts it into the mainstream.


Get a deeper understanding of the Apache Spark framework for Big Data analysis,  with our white paper:

Apache Spark CTA

About the Author

Shikha is a tech leader with deep expertise in emerging technologies such as Big Data analytics using MapR, Hortonworks, Tableau, and Spotfire. Her experience includes working with Fortune 500 companies, implementing solution design, architecting, and project managing. Shikha leads Technology for Syntelli and is passionate about non-profit causes and giving back to the community. Connect with Shikha on LinkedIn: [social_list linkedin_url=”https://www.linkedin.com/in/shikhabkashyap”]

Shikha Kashyap

Chief Technology Officer