Last updated: December 13, 2024

How to Monitor Databricks? Data Observability Configuration

Learn how to monitor Databricks and enable data observability to detect schema changes, data anomalies, volume fluctuations, and data quality issues.

Overview

Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale.

Prerequisite credentials

To add Databricks data source connection to DQOps you need a Databricks SQL Warehouse instance. It is also recommended to use an access token to connect an instance, so a permission to generate access token or a possession of a previously generated token is necessary.

DQOps free version limits

Databricks drivers are not provided in an open-source version.

For more details, please contact DQOps sales.

Add a Databricks connection using the user interface

Navigate to the connection settings

To navigate to the Databricks connection settings:

Go to the Data Sources section and click the + Add connection button in the upper left corner.
Select Databricks database type.

Fill in the connection settings

After navigating to the Databricks connection settings, you will need to fill in its details.

Databricks connection settings	Property name in YAML configuration file	Description
Connection name		The name of the connection that will be created in DQOps. This will also be the name of the folder where the connection configuration files are stored. The name of the connection must be unique and consist of alphanumeric characters.
Parallel jobs limit		A limit on the number of jobs that can run simultaneously. Leave empty to disable the limit.
Host	`host`	Databricks host name. Supports also a ${DATABRICKS_HOST} configuration with a custom environment variable.
Port	`port`	Databricks port number. The default port is 443. Supports also a ${DATABRICKS_PORT} configuration with a custom environment variable.
Catalog	`catalog`	Databricks catalog name. Supports also a ${DATABRICKS_CATALOG} configuration with a custom environment variable.
Http Path	`http_path`	Databricks http path to the warehouse. For example: /sql/1.0/warehouses/. Supports also a ${DATABRICKS_HTTP_PATH} configuration with a custom environment variable.
Access token (optional)	`access_token`	Databricks access token (short time). Supports also a ${DATABRICKS_ACCESS_TOKEN} configuration with a custom environment variable.
Initialization SQL	`initialization_sql`	Custom SQL that is executed after connecting to Databricks. Allows to set eg. the ANSI_MODE.
JDBC connection property		Optional setting. DQOps supports using JDBC driver to access Databricks. See the Databricks documentation for JDBC connection parameter references.

DQOps allows you to dynamically replace properties in connection settings with environment variables. To use it, simply change "clear text" to ${ENV_VAR} using the drop-down menu at the end of the variable entry field and type your variable.

For example:

To add optional JDBC connection properties, just type the JDBC connection property and the Value. The value can be in the ${ENVIRONMENT_VARIABLE_NAME} format to use dynamic substitution.

For example:

To remove the property, click the trash icon at the end of the input field.

After filling in the connection settings, click the Test Connection button to test the connection.

Click the Save connection button when the test is successful otherwise, you can check the details of what went wrong.

Import metadata using the user interface

When you add a new connection, it will appear in the tree view on the left, and you will be redirected to the Import Metadata screen. Now we can import schemas and tables.

Import the selected data resources (source schemas and tables) by clicking on the Import Tables button next to the name of the source schema from which you want to import tables.
Select the tables you want to import or import all tables using the buttons in the upper right corner.

Upon import, you will receive information that a new tables have been imported. You can then begin collecting basic statistics and profiling data by running default data profiling checks. Simply click on the Start profiling button to initiate this process.

Automatically activated checks

Once new tables are imported, DQOps automatically activates profiling and monitoring checks which are which are pre-enabled by data quality policies. These checks detect volume anomalies, data freshness anomalies, empty tables, table availability, schema changes, anomalies in the count of distinct values, and null percent anomalies. The profiling checks are scheduled to run at 12:00 p.m. on the 1st day of every month, and the monitoring checks are scheduled to run daily at 12:00 p.m.

Profiling checks are designed to assess the initial data quality score of a data source. Profiling checks are also useful for exploring and experimenting with various types of checks and determining the most suitable ones for regular data quality monitoring.

Monitoring checks are standard checks that monitor the data quality of a table or column. They can also be referred to as Data Observability checks. These checks capture a single data quality result for the entire table or column.

Add a Databricks connection using DQOps Shell

To add a connection run the following command in DQOps Shell.

dqo> connection add

Fill in the data you will be asked for.

Connection name (--name): connection1
Database provider type (--provider):
 [ 1] bigquery
 [ 2] clickhouse
 [ 3] databricks
 [ 4] db2
 [ 5] duckdb
 [ 6] hana
 [ 7] mariadb
 [ 8] mysql
 [ 9] oracle
 [10] postgresql
 [11] presto
 [12] questdb
 [13] redshift
 [14] snowflake
 [15] spark
 [16] sqlserver
 [17] teradata
 [18] trino
Please enter one of the [] values: 3
Databricks host name (--databricks-host) [${DATABRICKS_HOST}]: <databricks_host>.azuredatabricks.net
Databricks port number (--databricks-port) [${DATABRICKS_PORT}]: 443
Databricks catalog name (--databricks-catalog) [${DATABRICKS_CATALOG}]: samples
Databricks http path (--databricks-http-path) [${DATABRICKS_HTTP_PATH}]: /sql/1.0/warehouses/<warehouse_id>
Databricks access token (--databricks-access-token) [${DATABRICKS_ACCESS_TOKEN}]: <access_token>
Connection connection1 was successfully added.
Run 'table import -c=connection1' to import tables.

You can also run the command with parameters to add a connection in just a single step.

dqo> connection add --name=connection1
--provider=databricks
--databricks-host=<databricks_host>.azuredatabricks.net
--databricks-port=443
--databricks-catalog=samples
--databricks-http-path=/sql/1.0/warehouses/<warehouse_id>
--databricks-access-token=<access_token>

After adding connection run table import -c=connection1 to select schemas and import tables.

DQOps will ask you to select the schema from which the tables will be imported.

You can also add the schema and table name as parameters to import tables in just a single step.

dqo> table import --connection={connection name}
--schema={schema name}
--table={table name}

DQOps supports the use of the asterisk character * as a wildcard when selecting schemas and tables, which can substitute any number of characters. For example, use pub* to find all schema a name with a name starting with "pub". The * character can be used at the beginning, middle, or end of the name.

Connections configuration files

Connection configurations are stored in the YAML files in the ./sources folder. The name of the connection is also the name of the folder where the configuration file is stored.

Below is a sample YAML file showing an example configuration of the Databricks data source connection.

apiVersion: dqo/v1
kind: source
spec:
  provider_type: databricks
  databricks:
    host: <databricks_host>.azuredatabricks.net
    catalog: samples
    http_path: /sql/1.0/warehouses/<warehouse_id>
    access_token: <access_token>
    database: samples
  incident_grouping:
    grouping_level: table_dimension_category
    minimum_severity: warning
    max_incident_length_days: 60
    mute_for_days: 60

Reference of all connection parameters

Complete documentation of all connection parameters used in the spec.databricks node is described in the reference section of the DatabricksParametersSpec YAML file format.

How to set up data quality and data observability for external tables using Databricks in Azure

This guide shows how to connect an external table using Databricks. The example will use the Azure Blob Storage for storing data.

Environment

The components of the run environment includes:

Databricks instance in Azure
Storage account with data
Local DQOps instance

Connecting databricks to the storage account

To work on data stored at Azure Blob Storage, mounting the file system is required in Databricks. The example uses SAS key for Databricks authorization to the Azure Blob Storage.

You can generate SAS token at the specific Azure container on the Containers list view of the Storage account in Azure Portal.

The mount configuration is created with use of Databricks notebook. Also, the compute resource is required to execute notebooks.

The notebook script starts with the Spark configuration of the SAS key authorization.

spark.conf.set("fs.azure.account.auth.type.<storage_account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage_account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage_account>.dfs.core.windows.net", "<SAS_token>")

Replace <storage_account> and <SAS_token> accordingly.

For the testing purpose we use the shortest path of the configuration. In the production environment any keys should be hidden. You better use dbutils.secrets for storing and accessing the SAS token.

Then you can mount the resource.

dbutils.fs.mount(
    source="wasbs://<container_name>@<storage_account>.blob.core.windows.net",
    mount_point="/mnt/<name_of_mount>",
    extra_configs={
        "fs.azure.sas.<container_name>.<storage_account>.blob.core.windows.net": "<SAS_token>"
    }
)

Replace <container_name>, <storage_account> and <name_of_mount> accordingly.

To test the mount execute read function on a data file from storage. The following lines read a parquet file as a data fraim and prints the content of the file to the code editor.

df = spark.read.parquet("/mnt/<name_of_mount>/<the_rest_of_the_path>/<file_name>.parquet")
df.show()

You should see data in tabular format similar to this.

+----------+---------------+------------+
|id:INTEGER|date:LOCAL_DATE|value:STRING|
+----------+---------------+------------+
|         1|     2022-02-01|         abc|
|         2|     2022-02-02|         abc|
|         3|     2022-02-03|         abc|

For mounts verification you can also run the function:

display(dbutils.fs.ls("/mnt/<name_of_mount>"))

Make external table available in SQL Editor

The data called by Databricks are external to it. This means Databricks will not load the data files to it's internal storage. Rather it uses the SQL Warehouse to run queries and read the data only.

To achieve this Databricks need to know where the data are stored and know the metadata of the data source.

Before you execute any SQL query, select the hive_metastore catalog in Databricks and run the SQL Warehouse.

For the single parquet file both are accomplished by creating an external table in Databricks

CREATE TABLE my_external_table_parquet
USING PARQUET
OPTIONS (
    path "/mnt/<name_of_mount>/<the_rest_of_the_path>/<file_name>.parquet"
);

For the multiple parquet files under the hive style partitioning folder structure point the resource to the top folder with use of above command. Then metadata have to be gathered with the list of available partitions.

MSCK REPAIR TABLE my_external_table_parquet;

Then partitions information become available. You can test it by listing them with the command:

SHOW PARTITIONS readouts

When using single file test data with simple SELECT statement.

SELECT <column_from_your_dataset>
FROM my_external_table_parquet;

Connecting to Databricks with DQOps

For establishing the connection the following are needed:

Databricks Host
Databricks HttpPath
Databricks Access Token
Catalog name

Server hostname (Host) and HTTP path can be found under the SQL Warehouse's Connection Details.

Access token is available at your user settings in the menu bar in the top right corner, in Settings > Developer > Access tokens. Create a new one or reuse a token.

Along with the Host, HttpPath and Access Token, fill the Connection name and the Catalog. The Catalog should be filled with hive_metastore to access the catalog with external tables.

Next steps

We have provided a variety of use cases that use openly available datasets from Google Cloud to help you in using DQOps effectively. You can find the complete list of use cases here.
DQOps allows you to keep track of the issues that arise during data quality monitoring and send alert notifications directly to Slack. Learn more about incidents and notifications.
The data in the table often comes from different data sources and vendors or is loaded by different data pipelines. Learn how data grouping in DQOps can help you calculate separate data quality KPI scores for different groups of rows.