Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
You can use the unzip
Bash command to expand Zip (.zip
) compressed files or directories of files. The Azure Databricks %sh
magic command enables execution of arbitrary Bash code, including the unzip
command.
Apache Spark provides native codecs for interacting with compressed Parquet files. Most Parquet files written by Azure Databricks end with .snappy.parquet
, indicating they use snappy compression.
Download and unzip the file
Use curl
to download the compressed file and then unzip
to expand the data. The following example uses a zipped CSV file downloaded from the internet. See Download data from the internet.
%sh curl https://resources.lendingclub.com/LoanStats3a.csv.zip --output /tmp/LoanStats3a.csv.zip
unzip /tmp/LoanStats3a.csv.zip
Move the file to a volume
Now move the expanded file to a Unity Catalog volume:
%sh mv /tmp/LoanStats3a.csv /Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv
In this example, the downloaded data has a comment in the first row and a header in the second. Now that you have moved and expanded the data, use standard options for reading CSV files, for example:
df = spark.read.format("csv").option("skipRows", 1).option("header", True).load("/Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv")
display(df)