Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Note
This article applies to Databricks Connect 15.4 LTS and above.
This article describes how to create a project in your IDE, setup your virtual environment, install Databricks Connect for Python, and run code on serverless compute in your Databricks workspace.
This tutorial uses Python 3.12 and Databricks Connect 16.4 LTS. To use other versions of Python of Databricks Connect, they must be compatible. See the version support matrix.
Requirements
To complete this tutorial, the following requirements must be met:
- Python 3.12 is installed on your local machine.
- Your target Databricks workspace must have Unity Catalog enabled.
- You have an IDE installed, such as Visual Studio Code.
- Your local environment and compute meet the Databricks Connect for Python installation version requirements.
- Serverless compute is enabled in your workspace. See Connect to serverless compute.
- You have the Databricks CLI installed in your local machine. See Install or update the Databricks CLI.
Step 1: Configure Databricks authentication
This tutorial uses Databricks OAuth user-to-machine (U2M) authentication and a Databricks configuration profile for authenticating to your Databricks workspace.
Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace. In the following command, replace
<workspace-url>
with your Databricks workspace instance URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
.databricks auth login --host <workspace-url>
The Databricks CLI prompts you to save the information that you entered as a Databricks configuration profile. Press
Enter
to accept the suggested profile name, or enter the name of a new or existing profile. Databricks recommends usingDEFAULT
as your profile name.In your web browser, complete the on-screen instructions to log in to your Databricks workspace.
Step 2: Create a new Python virtual environment
Create your project folder and open it in your IDE. For example, in the Visual Studio Code main menu, click File > Open Folder > Open
Open a terminal window at the project folder root. For example, in the Visual Studio Code main menu, click View > Terminal.
Create a virtual environment for the project called
venv
at the root of the project folder by running the following command in the terminal:python3.12 -m venv .venv
Activate your virtual environment:
# Linux/Mac source .venv/bin/activate
# Windows .venv\Scripts\activate
Step 3: Install Databricks Connect
Install Databricks Connect. For information about the latest released version of Databricks Connect 16.4, see Databricks Connect for Databricks Runtime 16.4.
pip install "databricks-connect==16.4.*"
Step 4: Add code and run
Add a new Python file
main.py
to your projectEnter the following code into the file, replacing the placeholder
<profile-name>
with the name of your configuration profile from Step 1, then save the file. The default configuration profile name isDEFAULT
.from databricks.connect import DatabricksSession spark = DatabricksSession.builder.serverless().profile("<profile-name>").getOrCreate() df = spark.read.table("samples.nyctaxi.trips") df.show(5)
Run the code using the following command:
python3 main.py
Five rows of the table are returned:
+--------------------+---------------------+-------------+-----------+---------+-----------+ |tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip| +--------------------+---------------------+-------------+-----------+----------+-----------+ | 2016-02-16 22:40:45| 2016-02-16 22:59:25| 5.35| 18.5| 10003| 11238| | 2016-02-05 16:06:44| 2016-02-05 16:26:03| 6.5| 21.5| 10282| 10001| | 2016-02-08 07:39:25| 2016-02-08 07:44:14| 0.9| 5.5| 10119| 10003| | 2016-02-29 22:25:33| 2016-02-29 22:38:09| 3.5| 13.5| 10001| 11222| | 2016-02-03 17:21:02| 2016-02-03 17:23:24| 0.3| 3.5| 10028| 10028| +--------------------+---------------------+-------------+-----------+----------+-----------+
You have run successfully your first query on Databricks serverless compute using Databricks Connect from your IDE.
Step 5: Make your code production-ready
For production scenarios, it is important to avoid using compute specifications in the Spark session builder. For example, if you deploy your code to a classic cluster: Standard
or Dedicated
using the .
serverless()
API in your Spark session builder, a new serverless Spark session is created using the classic cluster as client.
To make your code flexible and ready for production, the Spark session should not contain any parameters.
spark = DatabricksSession.builder.getOrCreate()
However, when this code is run on Databricks, the default global Spark session of the Databricks compute is used.
To enable serverless compute in your IDE, use the DEFAULT configuration profile, which is selected by the DatabricksSession.builder
when no parameters are specified:
Create a configuration profile named
DEFAULT
using the instructions from step 1.Use a text editor to open the
.databrickscfg
file, which is found in:Your
$HOME
user home folder on Unix, Linux, or macOS:~/.databrickscfg
, orYour
%USERPROFILE%
(your user home) folder on Windows. For example, for macOS:nano ~/.databrickscfg
Add
serverless_compute_id = auto
to theDEFAULT
profile:[DEFAULT] host = https://my-workspace.cloud.databricks.com auth_type = databricks-cli serverless_compute_id = auto
Save the changes and exit your editor.
Modify your code to use a general Spark session and run it:
from databricks.connect import DatabricksSession spark = DatabricksSession.builder.getOrCreate() df = spark.read.table("samples.nyctaxi.trips") df.show(5)
python3 main.py
You have run your production-ready code successfully on Databricks serverless compute using Databricks Connect from your IDE using the DEFAULT configuration profile.
Tip
You can also use environment variables to set the connection to a specific Databricks compute:
- Serverless:
DATABRICKS_SERVERLESS_COMPUTE_ID=auto
- Classic:
DATABRICKS_CLUSTER_ID=<your_cluster_id>