Since it’s initial release Apache Spark™ has garnered a lot of support from within the tech community. It is now supported by major Hadoop vendors, including Cloudera, MapR, and HortonWorks. Cassandra vendor, DataStax, also supports Spark. Spark can also be run on Amazon Web Services. In terms of number of commits and committers, Spark has seen a threefold increase in activity within the past year.

2014 development effort has been dedicated towards a multitude of goals including:

  • Enhanced API stability with Spark 1.0
  • Vastly expanded UI & instrumentation
  • Integration with Hadoop Security
  • Disk spilling and shuffle optimizations
  • GraphX library
  • MLlib expansion

Spark 1.2 released towards the end of 2014 introduced many new features. A new developer-oriented Spark SQL Data Source API was installed. JDBC, CSV, Cassandra, and Hive Data Source providers were built ontop of this new API. New features were also poured into MLlib. Spark 1.2 adds support for new machine learning algorithms such as random forests, gradient boosted trees, and streaming k-means. It also expands code support for Python.

With the start of 2015, one should start seeing the establishment of Schema RDD (Resiliant Distributed Dataset) as the common interchange format. Schema RDD is an additional abstraction on top of plain RDD. Schema RDD contain additional type information for various fields within the RDD. Among obvious benefits, such as better optimization and performance, Schema RDD paves the way for a Panda’s like data frame API. A mature data frame API should make it easier for those who do not possess extensive software engineering skills, to interact with Spark. This arrangement should appeal to many data scientists in particular.

Various API’s can be layered on top of the Schema RDD abstraction. This arrangement allows developers to write their own code, customized for their particular need, and then run that code on top of the Spark platform. For example, the Data Source API allows developers to write code for any data source, and then operate on that data source with Spark SQL. The Pipeline API, would allow developers to write custom algorithms and workflow components for MLlib, and the Receiver API would allow developers to process streaming from within Spark.

Work has been ongoing to add flexibility to Spark’s column types. A vector type has made it easier to model sparse as well as dense data shapes. A metadata field has been added to the column schema. This field provides space for storing machine learning generated metadata.

With Databrick’s relentless focus on establishing Spark as a platform with easy-to-use API’s, the day when a vibrant mashup of Spark packages springs forth from all corners of the community should not be too far off. spark-packages.org provides a central location for sharing and discovering these packages.