How loop through parquet files using pyarrow for extracting meta-data within a job on a cluster?

Moritz Damm 0 Reputation points
2025-08-11T13:08:54.2+00:00

Hi,

my scenario is the following:

"To loop through parquet files stored in different sub-directories extracting meta-data to create a mastertable."

When using interactive development this scenario worked fine using pyarrow package and looping through the parquet files in separate sub-directories using the AzureMachineLearningFileSystem and wildcards:

fs = AzureMachineLearningFileSystem(path_fs)
list_files = fs.glob("**.parquet")

However, when trying to run the script as a component in a pipeline, it does not work.

The following error message returns:

"Error Message: /mnt/azureml/cr/j/xxx/cap/data-capability/wd/INPUT_raw_iba is not a valid datastore uri, data asset uri, or registry uri."

How can I get this scenario running as it is a common use case in our projects?

Azure Machine Learning
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Aryan Parashar 150 Reputation points Microsoft External Staff Moderator
    2025-08-12T10:45:03.3733333+00:00

    Hi Moritz,
    While running a pipeline in Azure ML Studio, it requires a valid datastore URI, data asset URI, or registry URI. The path in your error message is not a valid AzureML URI.

    A valid AzureML URI looks like this:

    azureml://subscriptions/<subscription-id>/resourcegroups/<resource-group>/workspaces/<workspace-name>/datastores/<datastore-name>/paths/<path-to-file>
    #This is the correct AzureML URI
    

    Make sure the URI you are using is the correct AzureML URI.

    The AzureMachineLearningFileSystem helper expects an Azure ML datastore URI.

    Here is the documentation for your reference: https://learn.microsoft.com/en-us/azure/machine-learning/concept-data?view=azureml-api-2

    Also, make sure that you have the correct YAML file for the pipeline component to accept the data you are providing.

    Feel free to accept this as an answer.

    Thank you for reaching out to the Microsoft QNA Portal.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.