Apache Spark is a data processing engine that runs super-fast through huge volumes of diverse data. It was a breakout success in 2015 and is likely to continue its popularity this year. To explain what the fuss is all about we delve into the trends and 4 ways Apache Spark contributes to Big Data analytics.
What’s happening with Spark
For some time now, batch processing is giving way to in-memory, micro-batched computation. This trend and others stimulate the widespread and growing interest in Apache Spark.
Analysts with their Big Data crystal balls say it’s a real trend, not just a fad. Here’s why:
- Apache Spark has moved from a being a component of the Hadoop ecosystem to the Big Data platform of choice for a growing number of enterprises. A recent Databricks survey shows that 48% of Spark deployments now run on Spark’s standalone cluster manager, while Hadoop YARN use is only around 40%.
- Spark is no longer a tech-only phenomenon. The Databricks survey also document that more than 15% of Spark users are in the financial services, healthcare and telecommunications sectors.
- Scores of vendors, hundreds of enterprise companies and hundreds of developers worked with Spark in 2015. The number of developers contributing code to Spark has grown to more than 1,000, twice the number in 2014,
- IBM, invested $300 million in the technology, after Cloudera moved to Spark from Hadoop MapReduce. Many other data integration and Big Data platform companies joined the party. Users include NBCUniversal, Netflix, Uber, Capital One COF and Baidu.
- Technology companies such as Salesforce.com embed Spark into their data analysis software to provide customers with faster data processing.
Better Quality and Big Data Processing Muscle
Why the lovefest? The rush to Spark is no mystery to developers, whose 600 contributors have made Spark the hottest project under the Apache Software umbrella. And IT buyers have discovered that open source software is getting much better, often better than proprietary software in terms of innovation and quality.
So what are the ways Apache Spark contributes to Big Data analytics?
- Spark is significantly faster and easier to manage than other Big Data processing methods. Its data processing engine combines data processing speed with in-memory performance. Benchmark tests show that Apache Spark is up to 100 times faster than Hadoop.
- A versatile toolkit. Spark components are all well-suited to achieving enterprise business goals. Spark combines in-memory processing, SQL, data streaming, machine learning, graph and R-based data analysis capabilities.
- Ability to keep up with IT evolution. Companies are generating and need to analyze diverse data types. Spark graphs, machine learning, libraries and other advanced capabilities work very well with these new types of data. With growth of the Internet of Things and use of unstructured data, Big Data is growing fast. Companies need technology that keeps up with its growth. Spark development kept pace in 2015. Databricks went through four releases in a year, each of which added hundreds of improvements.
- Rapid development. Spark accelerates the development process, which makes it faster and easier to test new ideas and launch new solutions. More than 80 high-level operators enable programmers to write code more quickly.
Spark is a versatile application development framework. It can work with different programming languages like Scala, Python, Java, Closure and R. Each Spark component contributes to RAD.
Next Time: We’ll highlight a growing number of compelling use cases around Spark. [divider]
Need to know more about Apache Spark and how you can use it to speed up and expand your development capabilities?
Contact us today or download our latest white paper!Download White Paper
About Vishwas: Vishwas provides solutions to big data problems like real-time streaming data, Traditional SQL vs NoSQL, Hadoop or Spark, Amazon Cloud Services (AWS) vs Personal Cluster. He is focused on analysing and providing the optimum solution for a particular use case. Prior to Syntelli, Vishwas was a Research Assistant at University of North Carolina at Charlotte. While at UNC Charlotte, Vishwas worked on integrating Big Data with Mobile Devices (IoT) and Deploying Language Classifier on a Pseudo Distributed Spark Cluster. Vishwas received a M.S in Electrical Engineering from the University of North Carolina at Charlotte. His research interests are Spark development, Visual Analytics, Android Devices.