Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article outlines how to use the copy activity in a data pipeline to copy data from and to Azure Databricks.
Prerequisites
To use this Azure Databricks connector, you need to set up a cluster in Azure Databricks.
- To copy data to Azure Databricks, Copy activity invokes Azure Databricks cluster to read data from an Azure Storage, which is either your original source or a staging area to where the service firstly writes the source data via built-in staged copy. Learn more from Azure Databricks as the destination.
- Similarly, to copy data from Azure Databricks, Copy activity invokes Azure Databricks cluster to write data to an Azure Storage, which is either your original destination or a staging area from where the service continues to write data to final destination via built-in staged copy. Learn more from Azure Databricks as the source.
The Databricks cluster needs to have access to Azure Blob or Azure Data Lake Storage Gen2 account, both the storage container/file system used for source/destination/staging and the container/file system where you want to write the Azure Databricks tables.
To use Azure Data Lake Storage Gen2, you can configure a service principal on the Databricks cluster as part of the Apache Spark configuration. Follow the steps in Access directly with service principal.
To use Azure Blob storage, you can configure a storage account access key or SAS token on the Databricks cluster as part of the Apache Spark configuration. Follow the steps in Access Azure Blob storage using the RDD API.
During copy activity execution, if the cluster you configured has been terminated, the service automatically starts it. If you author pipeline using authoring UI, for operations like data preview, you need to have a live cluster, the service won't start the cluster on your behalf.
Specify the cluster configuration
In the Cluster Mode drop-down, select Standard.
In the Databricks Runtime Version drop-down, select a Databricks runtime version.
Turn on Auto Optimize by adding the following properties to your Spark configuration:
spark.databricks.delta.optimizeWrite.enabled true spark.databricks.delta.autoCompact.enabled true
Configure your cluster depending on your integration and scaling needs.
For cluster configuration details, see Configure clusters.
Supported configuration
For the configuration of each tab under copy activity, go to the following sections respectively.
General
For General tab configuration, go to General.
Source
The following properties are supported for Azure Databricks under the Source tab of a copy activity.
The following properties are required:
Connection: Select an Azure Databricks connection from the connection list. If no connection exists, then create a new Azure Databricks connection.
Use query: Select Table or Query.
If you select Table:
Catalog: A catalog serves as the highest-level container within the Unity Catalog framework, it allows you to organize your data into databases and tables.
Database: Select your database from the drop-down list or type the database.
Table: Specify the name of the table to read data. Select the table from the drop-down list or type the table name.
If you select Query:
- Query: Specify the SQL query to read data. For the time travel control, follow the below pattern:
SELECT * FROM events TIMESTAMP AS OF timestamp_expression
SELECT * FROM events VERSION AS OF version
- Query: Specify the SQL query to read data. For the time travel control, follow the below pattern:
Under Advanced, you can specify the following fields:
Date format: Format date type to string with a date format. Custom date formats follow the formats at datetime pattern. If not specified, it uses the default value
yyyy-MM-dd
.Timestamp format: Format timestamp type to string with a timestamp format. Custom date formats follow the formats at datetime pattern. If not specified, it uses the default value
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
.
Direct copy from Azure Databricks
If your destination data store and format meet the criteria described in this section, you can use the Copy activity to directly copy from Azure Databricks to destination. The service checks the settings and fails the Copy activity run if the following criteria is not met:
The destination connection is Azure Blob storage or Azure Data Lake Storage Gen2. The account credential should be pre-configured in Azure Databricks cluster configuration, learn more from Prerequisites.
The destination data format is of Parquet, DelimitedText, or Avro with the following configurations, and points to a folder instead of file.
- For Parquet format, the compression codec is None, snappy, or gzip.
- For DelimitedText format:
rowDelimiter
is any single character.compression
can be None, bzip2, gzip.encodingName
UTF-7 is not supported.
- For Avro format, the compression codec is None, deflate, or snappy.
If copying data to DelimitedText, in copy activity sink,
fileExtension
need to be ".csv".In the Copy activity mapping, type conversion is not enabled.
Staged copy from Azure Databricks
When your sink data store or format does not match the direct copy criteria, as mentioned in the last section, enable the built-in staged copy using an interim Azure storage instance. The staged copy feature also provides you with better throughput. The service exports data from Azure Databricks into staging storage, then copies the data to sink, and finally cleans up your temporary data from the staging storage.
To use this feature, create an Azure Blob storage or Azure Data Lake Storage Gen2 that refers to the storage account as the interim staging. Then specify the enableStaging
and stagingSettings
properties in the Copy activity.
Note
The staging storage account credential should be pre-configured in Azure Databricks cluster configuration, learn more from Prerequisites.
Destination
The following properties are supported for Azure Databricks under the Destination tab of a copy activity.
The following properties are required:
Connection: Select an Azure Databricks connection from the connection list. If no connection exists, then create a new Azure Databricks connection.
Catalog: A catalog serves as the highest-level container within the Unity Catalog framework, it allows you to organize your data into databases and tables.
Database: Select your database from the drop-down list or type the database.
Table: Specify the name of the table to write data. Select the table from the drop-down list or type the table name.
Under Advanced, you can specify the following fields:
Pre-copy script: Specify a script for Copy Activity to execute before writing data into destination table in each run. You can use this property to clean up the pre-loaded data.
Timestamp format: Format timestamp type to string with a timestamp format. Custom date formats follow the formats at datetime pattern. If not specified, it uses the default value
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
.
Direct copy to Azure Databricks
If your source data store and format meet the criteria described in this section, you can use the Copy activity to directly copy from source to Azure Databricks. The service checks the settings and fails the Copy activity run if the following criteria is not met:
The source connection is Azure Blob storage or Azure Data Lake Storage Gen2. The account credential should be pre-configured in Azure Databricks cluster configuration, learn more from Prerequisites.
The source data format is of Parquet, DelimitedText, or Avro with the following configurations, and points to a folder instead of file.
- For Parquet format, the compression codec is None, snappy, or gzip.
- For DelimitedText format:
rowDelimiter
is default, or any single character.compression
can be None, bzip2, gzip.encodingName
UTF-7 is not supported.
- For Avro format, the compression codec is None, deflate, or snappy.
In the Copy activity source:
wildcardFileName
only contains wildcard*
but not?
, andwildcardFolderName
is not specified.prefix
,modifiedDateTimeStart
,modifiedDateTimeEnd
, andenablePartitionDiscovery
are not specified.
In the Copy activity mapping, type conversion is not enabled.
Staged copy to Azure Databricks
When your source data store or format does not match the direct copy criteria, as mentioned in the last section, enable the built-in staged copy using an interim Azure storage instance. The staged copy feature also provides you with better throughput. The service automatically converts the data to meet the data format requirements into staging storage, then load data into Azure Databricks from there. Finally, it cleans up your temporary data from the storage.
To use this feature, create an Azure Blob storage or Azure Data Lake Storage Gen2 that refers to the storage account as the interim staging. Then specify the enableStaging
and stagingSettings
properties in the Copy activity.
Note
The staging storage account credential should be pre-configured in Azure Databricks cluster configuration, learn more from Prerequisites.
Mapping
For Mapping tab configuration, go to Configure your mappings under mapping tab.
Settings
For Settings tab configuration, go to Configure your other settings under settings tab.
Table summary
The following tables contain more information about a copy activity in an Azure Databricks.
Source information
Name | Description | Value | Required | JSON script property |
---|---|---|---|---|
Connection | Your connection to the source data store. | < your Azure Databricks connection > | Yes | connection |
Use query | The way to read data. Apply Table to read data from the specified table or apply Query to read data using queries. | • Table • Query |
No | / |
For Table | ||||
Catalog | A catalog serves as the highest-level container within the Unity Catalog framework, it allows you to organize your data into databases and tables. | < your catalog > | No (choose default catalog if it’s null) | catalog |
Database | Your database that you use as source. | < your database > | No | database |
Table | Your source data table to read data. | < your table name > | No | table |
For Query | ||||
Query | Specify the SQL query to read data. For the time travel control, follow the below pattern: - SELECT * FROM events TIMESTAMP AS OF timestamp_expression - SELECT * FROM events VERSION AS OF version |
< your query > | No | query |
Date format | Format string to date type with a date format. Custom date formats follow the formats at datetime pattern. If not specified, it uses the default value yyyy-MM-dd . |
< your date format > | No | dateFormat |
Timestamp format | Format string to timestamp type with a timestamp format. Custom date formats follow the formats at datetime pattern. If not specified, it uses the default value yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] . |
< your timestamp format > | No | timestampFormat |
Destination information
Name | Description | Value | Required | JSON script property |
---|---|---|---|---|
Connection | Your connection to the destination data store. | < your Azure Databricks connection > | Yes | connection |
Catalog | A catalog serves as the highest-level container within the Unity Catalog framework, it allows you to organize your data into databases and tables. | < your catalog > | No (choose default catalog if it’s null) | catalog |
Database | Your database that you use as destination. | < your database > | Yes | database |
Table | Your destination data table to write data. | < your table name > | Yes | table |
Pre-copy script | Specify a script for Copy Activity to execute before writing data into destination table in each run. You can use this property to clean up the pre-loaded data. | < your pre-copy script> | No | preCopyScript |
Timestamp format | Format string to timestamp type with a timestamp format. Custom date formats follow the formats at datetime pattern. If not specified, it uses the default value yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] . |
< your timestamp format > | No | timestampFormat |