Share via


Expand and read Zip compressed files

You can use the unzip Bash command to expand Zip (.zip) compressed files or directories of files. The Azure Databricks %sh magic command enables execution of arbitrary Bash code, including the unzip command.

Apache Spark provides native codecs for interacting with compressed Parquet files. Most Parquet files written by Azure Databricks end with .snappy.parquet, indicating they use snappy compression.

Download and unzip the file

Use curl to download the compressed file and then unzip to expand the data. The following example uses a zipped CSV file downloaded from the internet. See Download data from the internet.

%sh curl https://resources.lendingclub.com/LoanStats3a.csv.zip --output /tmp/LoanStats3a.csv.zip
unzip /tmp/LoanStats3a.csv.zip

Move the file to a volume

Now move the expanded file to a Unity Catalog volume:

%sh mv /tmp/LoanStats3a.csv /Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv

In this example, the downloaded data has a comment in the first row and a header in the second. Now that you have moved and expanded the data, use standard options for reading CSV files, for example:

df = spark.read.format("csv").option("skipRows", 1).option("header", True).load("/Volumes/my_catalog/my_schema/my_volume/LoanStats3a.csv")
display(df)