11 minute read

Spark Snowflake - Generated with Stable Diffusion

Introduction

This article is intended for data engineers, solution architects and alike who deal with distributed computation engines for analytical batch workloads in their day to day work. It is an opiniated comparison between Apache Spark and seemingly easier-to-use cloud native SQL engines. By the end of this article, you will be challenged to think about why they are complementary and when each has its justification.

Distinctions

First, let us focus on three relevant differences between Apache Spark and its contenders.

SQL and DataFrame API

Among other languages like C, SQL is one of the oldest languages that is still actively used today. What is more, SQL has gained widespread adoption in recent years due to the rise of big data. Despite being half a century old, modern cloud native computation engines have picked up SQL as their primary interface to translate business logic into data pipelines (e.g., AWS Athena, Azure Synapse SQL, Databricks SQL, Snowflake).

SQL was intended specifically for managing and processing relational data. It was designed as a human readable, declarative language with emphasis on high level capabilities to create, read, update and delete data items. While SQL excels in this regard with its ease-of-use, it falls short on the flexibility to create powerful abstractions. Common programming language functionalities like control flow structures, object-oriented features, error handling and nonscalar types were not first-class citizens of the language. Rather, many of these concepts were added incrementally over time. Major relational database management system (RDBMS) vendors implemented their own procedural language extension on top of SQL to adress these shortcomings even before the SQL ANSI standard provided the corresponding specification (e.g., Oracle’s PL/SQL or PostgreSQL’s pgSQL). Most notably, SQL lacks orthogonality that describes the language’s ability to compose more complex constructs from simple building blocks in a consistent way [1]. New functionalities are difficult to be derived from existing ones. In consequence, the SQL specification grew enourmously to cover new requirements instead of providing libraries that could otherwise fullfil the need [2]. As an example, the SQL-2016 specification includes more than 360 keywords [3]. In contrast, C has only 32 keywords.

Apache Spark offers an ANSI SQL compliant interface, too. However, it also provides a sophisticated DataFrame API that constitutes an independent approach to formulate data transformations. Importantly, the DataFrame API is embedded in general purpose languages like Python or Scala. The DataFrame API maps concepts like tables, projections, filters and transformations into explicit object representations (e.g., dataframe) with attributes (e.g., column names and dtypes) and methods (e.g., filter, join and aggregate) within the corresponding programming language. Hence, it is called a DataFrame Application Programming Interface (API) because it allows to directly interact with a table object representation. In contrast to SQL, the DataFrame API allows to utilize all functionality and flexibility that the embedding general purpose languages provide along with their rich ecosystems.

Runtime flexibility

As the name indicates, cloud native SQL engines operate in the cloud only. Their runtimes can neither be executed on-premises or on a local machine. In contrast, Apache Spark runs on a local machine, can be deployed to on-premise clusters and is supported as a managed service on all major hyper scalers.

Vendor independence

Cloud native SQL engines are governed by specific cloud providers (e.g., Azure’s Synapse SQL) or cloud-based enterprises (e.g., Databricks’ SQL). In contrast, Apache Spark is neither coupled to an individual cloud provider nor company. It is open source software developed by a large and diverse community under the umbrella of the Apache Software Foundation (ASF). The ASF governance model strives to „ensure long-lived and stable projects by having a broad-enough community to maintain the project even in the absence of any individual volunteer or any sea change at a major vendor in that area[4].

However, note that Apache Spark receives major support from Databricks because the originators of Apache Spark are partially also the founders of Databricks.

Spark - Generated with Stable Diffusion

Implications

Next, let us focus on the consequences that can be derived from the distinctions regarding the DataFrame API, runtime flexibility and vendor independence.

Managing complexity

Data pipelines often inherit complexity from business demands like time series analysis. This typically includes multiple joins, aggregations, sophisticated window functions and more. Handling such complexity is a difficult task and asks for simplification via appropriate abstractions.

Since Apache Spark’s DataFrame API is embedded in a general purpose programming language, it allows to leverage software engineering concepts such as inheritance, composition, introspection and higher order functions. This provides data engineers a toolbox of abstractions to properly handle complexity. More precisely, it allows to structure, modularize, reuse, and dynamically generate building blocks of data pipelines.

In contrast, the commonly used SQL interface falls largely short on these functionalities. Due to the lack of abstraction capabilities, SQL queries often tend to result in complex, cluttered and hard-to-read walls of text. Stored procedures may improve modularity and reusability but introduce new difficulties themselves [5].

Frameworks like dbt, matillion and others address these shortcomings by delegating SQL code generation to function like abstractions (e.g., jinja templating language [6]).

Interestingly, Snowflake has recently added a DataFrame API via Snowpark that closely resembles Apache Spark’s DataFrame API [7]. Snowflake’s team probably realized the value that a programming language embedded DataFrame API provides in addition to the standard SQL inferface. Additionally, Snowpark was surely intended to provide an easy transition path for Apache Spark users to Snowflake’s data platform.

Testing

Since Apache Spark’s DataFrame API lives within the wider ecosystem of a programming language like python, rich testing frameworks such as pytest can be leveraged for great versatility to:

  • dynamically generate and reuse tests
  • provide readable comparisons between expected and actual results
  • combine and structure tests into categories
  • centrally collect test results and rerun failed tests

To ensure semantic correctness, data engineers can develop unit tests on individual transformations or end-to-end tests on entire pipelines in classical software development tradition. Moreover, advanced techniques like property-based testing [8] or mutation testing [9] can be leveraged to automatically reveal edge cases and harden test reliability. Once the test suite is set up, refactorings can be done safely.

Surprisingly, there is no general, vendor-independent, easy-to-use testing framework in the SQL space. This is not specific to cloud-based SQL engines, but is valid for SQL databases in general. While testability is taken into account as a design principle by programming languages, SQL makes testing business logic inherently difficult. SQL statements do not behave as functions with parametrizable inputs that can be easily mocked. Instead, the referenced tables are statically defined inside the FROM clause which makes them hard to be replaced with test data. Moreover, setting up test data and comparing actual and expected results via SQL can be cumbersome.

dbt attempts to solve the issue by providing a testing framework that works for multiple, popular database vendors [10]. Out of the box, dbt core covers data quality related tests such as nulls and duplicates. This does not test the semantic correctness of a data pipeline but rather provides quality gates for input data and downstream transformations. Lately, there has been considerable progress on a proper unit testing framework by dbt core [11] and related extensions [12]. However, it is somewhat astonishing that this feature has not yet reached first-class support considering the importance of ensuring the correct implementation of critical business logic.

Iteration speed

Due to Apache Spark’s runtime flexibility, tests can be run locally in standalone mode on a developer’s machine or in a continuous integration (CI) environment. This provides fast feedback and quick iterations during development. In contrast, cloud native SQL engines cannot be run locally or in a CI environment without cloud access. Each time you test, you need to spin up cloud compute. Each time you test, you will likely need to create lots of temporary input and output tables. Cloud SQL testing is both slower and likely more expensive than local or CI-based testing.

Debugging

With increasing complexity, errors and incorrect results will eventually occur sooner or later. Finding and solving bugs in a data pipeline follows the same principle as in traditional software development when utilizing the DataFrame API. Using a debugger of one’s favourite IDE allows to step through every line of your code while being able to interactively follow and modify the state of the data pipeline for full observability. Even though Apache Spark’s execution model is lazy, you can always use show on intermediate dataframe representations to inspect results.

This is not possible with cloud native SQL engines. For Spark’s DataFrame API, the smallest observable unit is a line of code or a dataframe statement. For SQL, the smallest observable unit is a complete view or table. Consequently, one needs to create lots of intermediate views or tables to properly debug in SQL which is both a manual and tedious process.

Declarative and imperative paradigm

SQL is declarative by heart. The user describes what result is desired. The underlying engine determines how this result needs to be computed with best performance in mind.

This is well suited in more than 95% of the cases for two reasons. First, the typical end user only cares about the resulting output. Second, vendors of SQL computation engines have decades of experience in optimizing the physical execution plan for a given logical query.

However, in the remaining 5% of the cases, especially with increasing complexity, even well optimized engines will eventually fail to find the best execution plan. This is particularly true in a distributed environment where the correct partitioning and physical layout of the data is key. If not handled correctly, data skew and inappropriate parallelism might result in insufficient memory (e.g., disk spills or even out of memory exception) or low CPU utilization (e.g., unused or waiting CPUs).

While Apache Spark’s DataFrame API is also declaritive, it provides the possibility to imperatively adjust the computation graph, too. In practical terms, Apache Spark allows to dictate how the computation plan is generated on a lower level. For example, the partitioning of data can be manually enforced via repartition. Intermediate results can be stored explicitly via cache to prevent expensive re-computations. Last but not least, join strategies may be provided directly via join hints (e.g. broadcast joins).

Future proof

Adoption

Due to Apache Spark’s vendor-independence and its wide industry penetration, managed Apache Spark runtimes exist across all major hyper scalers (e.g., AWS Glue/EMR, Azure Synapse/HDInsight, GCP Dataproc). Likewise, dedicated cloud platform solutions such as Databricks that place Apache Spark at their core have emerged. If you will ever need to change your cloud provider, you can be sure to run your Apache Spark workloads elsewhere, too. More so, Apache Spark runtimes can be deployed on-premises under your own control without any cloud provider dependencies at all.

In contrast, cloud native SQL engines are bound to individual cloud providers or companies. Essentially, this raises the risk of a vendor-lockin along with its associated negative consequences (e.g., increased pricing policies). Migrating your SQL-based pipelines to a different SQL engine should be feasible as long as you stick to standard SQL avoiding vendor-specific functionalities and dialect.

Performance

While Apache Spark runs at least a magnitude faster than its predecessor Hadoop’s Map-Reduce, other more dedicated compute engines have developed since Apache Spark’s emergence. For example, Databricks has recently released a proprietary C++ compute engine called Photon that effectively moves compute heavy workloads outside the JVM away from Apache Spark. In contrast to native Apache Spark, Photon shows a 4 times speedup on TPC-H benchmarks [13]. There have been similar developments in the open source space such as blaze that takes an optimized execution plan from Apache Spark and delegates its execution to Apache Arrow based compute kernels [14]. Both these examples indicate the next generation of compute engines that replace Apache Spark JVMs based compute. However, Apache Spark’s distributed planning and scheduling framework persists in both cases. Hence, we might envision a future Apache Spark with a state-of-the-art computation engine.

Required skill and ease of use

The abovementioned advantages of Apache Spark’s DataFrame API do not come for free. In contrast to SQL, Apache Spark’s DataFrame API requires proficiency in a programming language. Setting up tests for Apache Spark on a local machine and in CI can be difficult at first. Moreover, Apache Spark forces its users to explicitly think about its distributed computation and lazy execution model to effectively make use of parallelism and caching. This is hidden and abstracted away in SQL. Last but not least, setting up an Apache Spark cluster is still more complicated and requires more thought than using native SQL compute backends even though this has greatly improved via managed cloud services.

Spark Cloud - Generated with Stable Diffusion

Conclusion

Finally, let me revisit the claim I made in the introduction about the complementary nature of Apache Spark and cloud native SQL engines.

Best of both worlds

While SQL is the lingua franca for data analysis, it might not be the best choice for every problem. In my opinion, SQL is great for dashboarding and business intelligence (BI) due to its ease-of-use including simple queries with first level filter and aggregation. However, business-critical data pipelines with high complexity are better suited to be implemented via Apache Spark particulary due to its DataFrame API because:

  • complexity can be handled with well-established concepts from software engineering
  • semantic correctness can be guaranteed with dedicated testing frameworks
  • developer productivity is greater due to better abstractions, higher automation, and better debugging support
  • future-proof is ensured due to its wide adoption, runtime flexibility and vendor independence

I argue that these advantages outweigh the increased expertise required for Apache Spark in the long run. Once the processed data has reached a sufficient level of analytical readiness, SQL can be safely used to deduce actionable insights to answer business relevant questions.