Snowflake Vs Data Bricks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

The Snowflake vs.

Databricks
breakdown
Which data platform fits best with the needs
of your organization?

vs.

Two of the most dynamic and fastest growing companies in the big data
world — Snowflake and Databricks, were built around innovative concepts.

Both companies offer expansive sets of consistently updated features


within a unique design and architecture. Simply put, each platform stores
data, ingests and transforms data, and produces analytics.

Within those main functions, Snowflake and Databricks have a range of


capabilities that better fit the strategies of individual organizations.

Wavicle's expert cloud data consultants created a comprehensive guide for


you to use as a comparison-based starting point for evaluating which
platform better suits your needs.

Snowflake vs. Databricks

1 Value & Architecture

2 Storage

3 Ingestion & transformation

4 Data analytics

5 Additional features

©2021 All rights reserved | Privacy Policy 1


1 Value & Architecture

Platform values

Snowflake: The Data Lake Databricks: The Lakehouse Platform


▪ One platform ▪ Unified Analytics platform
▪ One copy of data ▪ Open format storage layer with Delta Lake
▪ Many workloads ▪ Structured transactional layer
▪ Near-zero maintenance ▪ High performance query engine
▪ Near-unlimited performance and scale ▪ One platform for every use case
▪ Secured and governed access to data

Wavicle insights
Snowflake’s values of speed, scalability, and sharing are built throughout. For a rapidly expanding
organization that needs to handle a significant number of concurrent workloads and share data across
multiple partners efficiently and securely, Snowflake is a strong choice.

For Databricks, the foundation of data science is evident in the platform's value pillars. Databricks is suited for
a wide-variety of machine learning cases. Organizations focused on scalable data engineering, collaborative
data science, and transforming large volumes of unstructured data should be intrigued by Databricks.

Architecture

Snowflake: A three-tier design Databricks:


1. Centralized storage Infrastructure and governance supported with Data
2. Multi-cluster compute Mesh, an organizational and architectural
paradigm. It has an emphasis on decentralized
3. Cloud services
data ownership.
▪ Optimization
▪ Decentralized data teams and ownership
▪ Management
▪ Data products driven by domain driven design
▪ Transactions
▪ Self serve data infrastructure
▪ Security and governance
▪ Global federated governance
▪ Metadata
▪ Sharing and collaboration Data Mesh blueprint:
▪ Streaming phases
– Bronze: Raw operational data
– Silver: Curated atomic and cleansed data
– Gold: Aggregated data (data marts) for
specific datasets
▪ Data catalog
▪ Integration

©2021 All rights reserved | Privacy Policy 2


Value & Architecture continued

Architecture features

Snowflake: Databricks:
▪ Snowflake is available on AWS, Azure, or ▪ Databricks is a native component of Azure
GCS and is also available on AWS
▪ Data stored in Snowflake storage ▪ Delta Lake sits on top of your existing data
▪ Data can be accessed from S3, Azure Blob lake, delivering reliability, security, and
Storage or GCS performance

▪ Data loaded to Snowflake is indexed and ▪ Accessed by SQL / ML layers


partitioned during ingestion ▪ Data can be accessed from Amazon S3,
▪ De-coupled compute and storage Azure Storage, or GCS

▪ Virtual warehouses (VWH) can be instantly


scaled from SQL command line or web-based
GUI. VWH can also be configured to
autoscale.

Wavicle insights
Both of the platforms can be spun up on AWS, Azure, and GCP platforms. Snowflake does not require any pre-
planning or maintenance to start, eliminating the need for a database administrator in many cases. It automatically
runs across three availability zones, allowing for replication to an alternate cloud.

Fully elastic autoscaling, a hallmark feature of Snowflake, means increasing or decreasing the size of an instance can
be completed easily.

For creating a Databricks cluster, there’s three different cluster modes: Standard, High Concurrency, and Single
Node. For the user, deciding which cluster mode to use can be a challenge but is the key to managing cost and
performance.

Databricks also features autoscaling by leveraging reporting statistics to scale up, or, remove workers in the cluster.
To use and maintain Databricks, users need to have some level of knowledge surrounding cloud infrastructure
components and how they work together.

Snowflake’s architecture means a rapid rollout to start, with levels of automation. This makes it a great choice for an
organization that may not have the initial bandwidth or expertise in the platform.

The customizable options of clustering for Databricks are a very attractive feature but requires strong competency in
the platform and users must choose between cost and performance during configuration.

©2021 All rights reserved | Privacy Policy 3


2 Storage

Data warehouses and data lakes

Snowflake Cloud Data Warehouse: Databricks Delta Lake:


Snowflake’s Cloud Data Warehouse is SaaS-based Databricks’s answer to a data warehouse is well
and built on top of Amazon Redshift or Microsoft beyond the traditional model. The Delta Lake is an
Azure cloud infrastructure. Users do not need to open format storage layer and a single home for
install, configure, or manage hardware or software. structured, semi-structured, and unstructured
Storage, compute, and services are independently
data.
elastic and give users flexibility for what they need
most. It provides traditional warehouse features like
schemas for each table. All data in Delta Lake is
With the introduction of Snowpark, Snowflake
stored in open Apache Parquet format, allowing
customers can write queries using procedural
data to be read by any compatible reader.
programming languages.

Snowflake for Data Lakes:


The company totes its answer to data lakes as a
flexible solution to enable or enhance data lake
strategy. What does that mean for potential
customers? A centralized repository for structured
and unstructured data alike, with the latter
functionality currently in Public Preview.

Wavicle insights
Both platforms are leading the collision of traditional data warehouses and data lakes. The overlapping
capabilities and names can become blurred. Organizations that don’t have the time or resources for setup,
maintenance, and support of servers should consider Snowflake.

If management of a data lake and data warehouse is an issue for an organization, Databricks can help solve
the problem, along with its advanced analytics and AI/ML capabilities.

Access

Snowflake: Databricks:
Democratized data access and simplified and Databricks provides access control down to the
controllable data governance are a hallmark storage layer by leveraging AWS security controls
feature of Snowflake. The flexibility and security within the platform. At the same time, Databricks
policies are designed to boost innovation. provides access control to compute resources, API
provisioning and permission management, audit
Snowflake’s governance is designed into the
logging with Amazon Cloud Trail, and Amazon
platform featuring access control to accounts and
CloudWatch.
users, column-level security, row access policies,
audit logging for access history and object tagging
for sensitive data for compliance, discovery,
protection, and resource usage.

Wavicle insights
Snowflake’s emphasis on democratized access and security are a big plus for the platform. However, that
strength comes with a variable—difficult to manage operational governance and CPU cost. With easier
control of compute resources, Databricks provides more transparent cost and relies on AWS for its security
functions.

If an organization needs day one access to sensitive data across various units at scale, Snowflake is a great
choice. If more efficiently managed spend and familiar AWS features are appealing, Databricks can be quickly
operationalized.

©2021 All rights reserved | Privacy Policy 4


3 Ingestion & transformation

Pipeline

Snowflake Data Engineering and Databricks Autoloader:


Snowpipe: An efficient solution that incrementally processes
As with most features of Snowflake, building the new data files as they arrive into cloud storage.
pipeline into the platform is about speed, ▪ Amazon: Databricks can ingest data from all
efficiency, and ease of use. With Snowflake, data AWS.
engineers can spend little to no time managing
▪ Microsoft: Databricks can ingest data directly
workloads, making it scalable to handle
from Azure storage.
concurrency and computation requirements.
▪ Google: Databricks can ingest data directly
Snowflake’s Snowpipe enables loading data from from GCS.
files as soon as they’re available in a stage for
event-based real time ingestion into the table.
▪ Amazon: Snowflake can ingest data from all
AWS.
▪ Microsoft: Snowflake can ingest data directly
from Azure storage.
▪ Google: Snowflake can ingest data directly
from GCS.

Databricks Autoloader

Before After

Notification Message
Service Queue

Stream

Batch

Delayed
Schedule
External
Trigger
Airflow File
Sensor
▪ Pipe data from cloud storage into Delta Lake
as it arrives
▪ Gets too complicated for multiple jobs ▪ “Set and forget” model eliminates complex
setup

©2021 All rights reserved | Privacy Policy 5


Ingestion & transformation continued

Pipeline integrations

Snowflake: Databricks:
Snowflake features Snowpipe integrations from the Databricks automates streaming data ingestion
following cloud storage services. Snowpipe loads and transformation with StreamSets. The
data into Snowflake as soon as that data is partnership provides a fast and easy to use drag
available in the staging layer. and drop interface. It allows users to design, test,
▪ Amazon Web Services: Amazon S3 and monitor batch and streaming ETL pipelines
without the need for coding or specialized skills.
▪ Google Cloud Storage
▪ Microsoft Azure Blob Storage
▪ Microsoft Data Lake Storage Gen2 Databricks + StreamSets:
▪ Microsoft General-purpose v2
Control Hub
Amazon Google Multi-tenant, CI/CD, Provenance
Snowflake
Account Host Web Cloud Microsoft

StreamSets
Services Platform Azure
Transformer
Amazon S3 ✔ — — Visual ETL and push down transformations
on Delta Lake

Google Cloud Dataflow Sensors


Storage ✔ ✔ — Data Drift Detection

Microsoft Azure
Blob Storage ✔ — ✔
Microsoft Data
Lake Storage Gen2 ✔ — ✔
Databricks

Microsoft Azure
General-purpose v2 ✔ — ✔ Unified Data Analytics Reliable Data Lakes
Engine at Scale

Supported cloud storage services


The following table indicates the cloud storage
service support for Snowpipe REST API calls from
Snowflake accounts hosted on each cloud
platform:

Snowflake Amazon Google


Account Host Web Cloud Microsoft
Services Platform Azure

Amazon S3 ✔ ✔ ✔
Google Cloud
Storage ✔ ✔ ✔
Microsoft Azure
Blob Storage ✔ ✔ ✔
Microsoft Azure
General-purpose v2 ✔ ✔ ✔

Microsoft Data
Lake Storage Gen2 ✔ ✔ ✔

Wavicle insights
Both platforms are designed for fast, easy, and multiple sourced ingestion. ELT/ETL tools like Matillion,
Talend, and SnapLogic can be used on both platforms to easily ingest and migrate data.

The multitude of ingestion capabilities for both platforms means excellent flexibility for the major cloud
providers. Customers of Amazon, Microsoft, or Google should be comfortable with either platform.

©2021 All rights reserved | Privacy Policy 6


Ingestion & transformation continued

Performance

Snowflake: Databricks:
In head-to-head comparisons conducted by According to the Transaction Processing
independent companies and with minimal Performance Council, Databricks SQL is now the
configurations or tuning, Snowflake out performed record holder for data warehouse performance.
other cloud data warehouses on query time and For data scientists, the performance clusters of
related costs. This means Snowflake is almost a Databricks allow large-scale data batch
serverless solution. processing and real-time stream data processing.
The ability of its ML, deep learning, and graph
analysis are exactly what you would expect from
the founders of Apache Spark and MLflow.

Wavicle insights
As both platforms continue to improve at a rapid pace, performance will be a continued debate. While the
tests may show contradicting results, they are impacted by use case, configuration of systems, code, and
structure of underlying data.
Both platforms are top-of-class performers. For pure speed involving query time, Snowflake’s near serverless
solution continues as a standard of pure performance. The performance of large batch processing and ML for
Databricks makes it the pinnacle of data science related performance.

Data sharing

Snowflake:
Snowflake: Databricks:
Another
Anotherone
oneofofthethe
pillars for for
pillars the the
creation of
creation Databricks in its current form does not allow
Snowflake was data sharing. Data can be for cloning of data, only copying. With the
of Snowflake was data sharing. Data can
cloned fast and within one or more data introduction of Delta Sharing, Databricks
be cloned fast
warehouses. Thisand within
allows one or data
for sharing more users can share secured and real-time large
data warehouses.
without This allows for sharing
copying or moving. datasets for sharing data cross products. This
data without copying or moving. allows for sharing any data set in Delta Lake
Secured data sharing across Snowflake
or Apache Parquet formats.
objects:
Secured
▪ Tables data sharing across Snowflake
objects:
▪ External tables
▪ Secure views
• Tables
▪ Secure materialized views
• External tables
▪ Secure user-defined functions
• Secure views
• Secure materialized views
• Secure user-defined functions
Wavicle insights
The exciting addition of the first version of Delta Sharing is a major upgrade in this category for Databricks,
Databricks:
but it’s still limited in its scope. The future plans call for sharing objects, such as streams, SQL views, or
Databricks
arbitrary files.in its current form does not
allow for cloning of data, only copying.
Snowflake’s cloning and wide-spread data sharing capabilities make it a great choice for organizations that
With to
need theshare
introduction ofwide
data with a Delta Sharing,
variety of partners, vendors, or customers.
Databricks users can share secured and
real-time large datasets for sharing data
cross products. This allows for sharing
Data format
any data set in Delta Lake or Apache
Parquet formats.
Snowflake: Databricks:
Snowflake handles structured and unstructured Databricks default data format is Parquet and all
data natively. Semi-structured JSON, Avro, ORC, data stored in the Delta Lake is stored in Parquet
Parquet, or XML can be loaded into a single field. format. Databricks can read semi-structured data
The Query API allows parsing unstructured data at like JSON. By using the combination of Databricks
speed and scale. & Labelbox, you can effectively handle unstructured
data. With Sparser, Databricks users can rapidly
parse unstructured data formats in Apache Spark.

Wavicle insights
Both platforms are able to handle a wide range of data formats. This really comes down to preference and
experience.

©2021 All rights reserved | Privacy Policy 7


4 Data analytics

BI and visualization

Snowflake: Databricks:
Snowflake is compatible with several BI and Databricks comes with built-in BI functionality but
visualization tools such as Tableau, PowerBI, and it is not the strongest feature. It’s compatible with
ThoughtSpot. tools like Tableau and ThoughtSpot for analyzing
data lakes at scale.

Wavicle insights
Both platforms integrate well with leading BI and visualization tools. There isn’t a distinct advantage for either,
unless you need to handle significant numbers of concurrent users, then Snowflake would be a better choice.

AI/ML

Snowflake: Databricks:
Snowflake is designed to support machine learning Built on top of MLflow, Managed MLflow,
and in conjunction with tight integrations to Spark, Databricks’ open-source platform, manages the
R, Qubole, and Python. Snowflake performance complete ML lifecycle, including experimentation,
means scaling up or down but it also takes on data reproducibility, deployment, and a central model
curation responsibilities and reduced data-related registry with enterprise reliability, security, and
burdens from ML tools. scale.

In 2021, Snowflake introduced Snowpark, so ▪ Built-in Spark


developers can build queries using DataFrames ▪ Managed MLFlow
right in their code, without having to create and ▪ ML Runtime
pass along SQL strings.
▪ Collaborative Notebooks
Built-in: Integration is available with: ▪ Feature Store
▪ Spark ▪ DataIku ▪ AutoML
▪ Python ▪ Data Robot
▪ Java ▪ Amazon Sagemaker
▪ Seashop
▪ Node.js

Data foundation for the full ML Lifecycle

All Kinds of Managed MLFLOW


Data Sources

Data Model
Logging & Experiment Registry & Metric
Versioning Tracking Servicing Tracking

ML Logging
Feature Store ML Runtime ML Deployment & Monitoring

Production Metrics

©2021 All rights reserved | Privacy Policy 8


Data analytics continued

Wavicle insights
Databricks was designed from its creation to be the most powerful, efficient, and collaborative environment
for machine learning and that remains the truth. Even with the introduction of a model like Snowpark for
additional developer languages, Databricks is still the premier platform for AI/ML. Organizations with a strong
need for ML within their caseloads should look to Databricks or a combination of the two.

ML integration

Snowflake: Databricks:
Snowflake can access code directly from Jupyter, For the more hands-on-the-keys crowd,
Notebooks, or JAR files from within the platform. Databricks has built-in ML functionality for Jupyter
and Notebooks.

Wavicle insights
The built-in ML functionalities of Databricks makes it the most efficient and collaborative environment for
developers with heavy use of ML.

UI
With Snowflake’s platform meant for a variety of end users, the UI is easier to navigate. As for
Databricks, it is designed for ultimate function over form.

Scalability

Snowflake: Databricks:
Storage, compute, and services are independently Users can enable clusters for auto-scaling based
elastic. Users can spin up separate virtual on workload with serverless pools to deal with
warehouses instantly to support ETL, ELT, and BI concurrency.
workloads with no resource contention.

Wavicle insights
For scalability, each platform has very distinct characteristics. As mentioned, the independent elasticity of
Snowflake creates a top-of-class model for scalability and for organizations where it’s a top priority,
Snowflake is a strong choice.

©2021 All rights reserved | Privacy Policy 9


5 Additional features

Snowflake: Databricks:
▪ Time travel to query data from different points ▪ Supports Python, Scala, R and, SQL OOB
in time ▪ Optimized for machine and deep learning
▪ Clone and restore data from tables, schemas, ▪ Manage a machine learning pipeline
or entire databases for a point in time
▪ Restore tables from a point in time or before
updates were made
▪ Geo-spatial data for calculating distance is
built into Snowflake

Pricing and cost optimization

Snowflake: Databricks:
▪ Usage based on a combination of time and ▪ Minimal users model – lower cost
compute ▪ Enterprise level users – higher cost
▪ Auto-scaling and increasing VM sizing during ▪ Auto-scaling configurations
SQL processing can streamline costs

Interoperability
Despite the differences, Snowflake and Databricks have a high-level of interoperability. Snowflake
can read data from Databricks for analysis and visualization. Databricks fills the role of a
connector that can read and process data within the platform and push results to Snowflake. In
an ideal world, organizations across the board could utilize both platforms for their advantages.

Wavicle insights
Organizations across various industries utilize both platforms for their distinct advantages. This “best of both
worlds” stack sets up data engineers and data scientists alike in fast, scalable, and collaborative
environments. Wavicle has experience enacting this powerful stack simultaneously for clients.

The choice
Well, the truth is that it will take much more than a guide to determine which platform, Snowflake or
Databricks, is the best fit for your organization. Many organizations leverage both platforms for their unique
capabilities in a powerful stack.

Each platform is a pathway for storing, ingesting, transforming, and analyzing data. Regardless of which way
you are leaning, Wavicle can help you make the best possible choice based on your business strategy and goals.

With our deep technical expertise and our proprietary accelerators, we migrate data quickly and integrate
Snowflake, Databricks, or both into your technology stack.

Are you looking to add Snowflake, Databricks, or both into your organization? Our expert cloud consultants
bring proven experience with each to ensure you get the most out of the platforms.

Learn how
Wavicle can help you choose and implement the data
It’s time to grow with us
analytics architecture that will meet your business goals.

©2021 All rights reserved | Privacy Policy 10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy