How to read a csv file from synapse link service to a storage data lake

Question

How to read a csv file from synapse link service to a storage data lake

MrFlinstone 706

I am trying to use pyspark via synapse to read a table/csv file that resides on a storage account. I have got the code below, but 2 things. Its failing due to access and i am not sure if the syntax is correct wither. I have got a synapse spark pool and its not clear if its using the managed service principal to try to login to the storage account or if its my account, i know its not my account because my account has access and i am able to use openrowset to query the data.

This doesnt work


# Define storage variables
storage_account_name = "my-storage-account"
container_name = "mydataverse-xxxxxxxxxxxxxxxx"
file_pattern = "*2025-05.csv*"

# Build full path (abfss is the secure DFS path for ADLS Gen2)
path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/mytable/{file_pattern}"

# Read CSV with Spark
df = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(path)

# Show first 100 rows
df.show(100)

And this works

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://my-storage-account.dfs.core.windows.net/mydataverse-xxxxxxxxxxxxxxxx/mytable/*2025-05.csv*',
        FORMAT = 'CSV',
        PARSER_VERSION = '2.0'
    ) AS [result]

Thanks in advance.

1 answer

Your answer

Answer 1

Pratyush Vashistha 900 Microsoft External Staff Moderator

Hello MrFlinstone!

Thanks for sharing the details and posting your query on Microsoft QnA!.

I understand you're trying to read a CSV file from your storage account using PySpark in Synapse, and while your SQL OPENROWSET query works, the PySpark code is failing due to access issues.

Here’s a simplified explanation of what’s happening and how you can fix it:

When you run the SQL query, it uses your user account, which has access to the storage account—so it works.

But when you use PySpark in Synapse, it doesn’t use your account. Instead, it uses the managed identity of the Synapse workspace or Spark pool.

Could you check if That identity currently has permission to access the storage account? If not, then please follow the steps given below.

Check the Managed Identity

Go to your Synapse workspace in the Azure portal.
Under Identity, make sure System Assigned is turned on.
- Copy the Object ID of that identity. Learn how: https://docs.azure.cn/en-us/data-factory/credentials?tabs=data-factory

Give It Access

Go to your storage account → Access Control (IAM).
- Add a role assignment: choose Storage Blob Data Reader or Storage Blob Data Contributor. Learn how: https://learn.microsoft.com/en-us/azure/storage/blobs/assign-azure-role-data-access?tabs=portal
- Assign it to the managed identity you just found.

Network Settings
- If your storage account has firewall rules or is behind a VNet, make sure Synapse can reach it. You might need to allow Trusted Microsoft Services or set up a Private Endpoint. Refer this learn doc: "https://learn.microsoft.com/en-us/azure/synapse-analytics/security/connect-to-a-secure-storage-account"

Finally, once you’ve given the right permissions and fixed the code:

Try running the notebook again.

If it still fails, check the Synapse job logs for any access errors and share the error snapshot or logs with us.

Let me know how it goes or if you'd like help walking through any of these steps!

Please "Accept the Answer" if the response is helpful.

Thanks

Pratyush

Pratyush Vashistha 900 Reputation points Microsoft External Staff Moderator

2025-08-11T03:10:15.6833333+00:00

Hello MrFlinstone!

Just checking to see if you have a chance to check my previous response and helped, do let me know if you have any further questions on this. If it helped, kindly "Accept as an Answer"

Thanks

Pratyush

MrFlinstone 706

I am still getting errors, running the same piece of code on a different synapse environment and it works, I have granted myself storage blob contributor role, running the below from a synapse notebook.

from pyspark.sql import SparkSession
account = 'storage-account'
container = 'testcontainer'
filename = 'sales.csv'
read_path = 'abfss://%s@%s.dfs.core.windows.net/%s' % (container, account, filename)
read_path

df = spark.read.option("header","false") \
        .option("delimeter",",") \
        .csv(read_path)
display(df)

I then get the error, which appears to be permission related but i am not sure what other permission I need to assign myself on the storage account, is there anywhere i can check on the storage account for which principal is trying to access the resource and failing. This way I could reverse engineer it and know what principal to grant access.

: java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD, https://storage-account.dfs.core.windows.net/testcontainer/?upn=false&action=getAccessControl&timeout=90 	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1443)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:652)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:640)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1760)
	at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.exists(AzureBlobFileSystem.java:1236)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4(DataSource.scala:757)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$4$adapted(DataSource.scala:755)
	at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:393)
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
	at scala.util.Success.$anonfun$map$1(Try.scala:255)
	at scala.util.Success.map(Try.scala:213)
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: Operation failed: "This request is not authorized to perform this operation.", 403, HEAD,

Share via

How to read a csv file from synapse link service to a storage data lake

1 answer

Your answer