How to read a csv file from synapse link service to a storage data lake

MrFlinstone 706 Reputation points
2025-08-07T15:52:16.0633333+00:00

I am trying to use pyspark via synapse to read a table/csv file that resides on a storage account. I have got the code below, but 2 things. Its failing due to access and i am not sure if the syntax is correct wither. I have got a synapse spark pool and its not clear if its using the managed service principal to try to login to the storage account or if its my account, i know its not my account because my account has access and i am able to use openrowset to query the data.

This doesnt work


# Define storage variables
storage_account_name = "my-storage-account"
container_name = "mydataverse-xxxxxxxxxxxxxxxx"
file_pattern = "*2025-05.csv*"

# Build full path (abfss is the secure DFS path for ADLS Gen2)
path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/mytable/{file_pattern}"

# Read CSV with Spark
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(path)

# Show first 100 rows
df.show(100)


And this works

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://my-storage-account.dfs.core.windows.net/mydataverse-xxxxxxxxxxxxxxxx/mytable/*2025-05.csv*',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
    ) AS [result]

Thanks in advance.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Pratyush Vashistha 900 Reputation points Microsoft External Staff Moderator
    2025-08-08T05:48:16.4433333+00:00

    Hello MrFlinstone!

    Thanks for sharing the details and posting your query on Microsoft QnA!.

    I understand you're trying to read a CSV file from your storage account using PySpark in Synapse, and while your SQL OPENROWSET query works, the PySpark code is failing due to access issues.

    Here’s a simplified explanation of what’s happening and how you can fix it:

    When you run the SQL query, it uses your user account, which has access to the storage account—so it works.

    But when you use PySpark in Synapse, it doesn’t use your account. Instead, it uses the managed identity of the Synapse workspace or Spark pool.

    Could you check if That identity currently has permission to access the storage account? If not, then please follow the steps given below.

    1. Check the Managed Identity
    1. Give It Access
    1. Network Settings

    Finally, once you’ve given the right permissions and fixed the code:

    Try running the notebook again.

    • If it still fails, check the Synapse job logs for any access errors and share the error snapshot or logs with us.

    Let me know how it goes or if you'd like help walking through any of these steps!

    Please "Accept the Answer" if the response is helpful.

    Thanks

    Pratyush


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.