Edit

Share via


Fast copy in Dataflow Gen2

Fast copy helps you move large amounts of data faster in Dataflow Gen2. Think of it as switching to a more powerful engine when you need to handle terabytes of data.

When you're working with dataflows, you need to ingest data first, then transform it. With dataflow scale-out using SQL DW compute, you can transform data at scale. Fast copy takes care of the ingestion part by giving you the easy dataflow experience with the powerful backend of pipeline Copy Activity.

Here's how it works: After you enable fast copy, dataflows automatically switch to the faster backend when your data size passes a certain threshold. You don't need to change anything while building your dataflows. After your dataflow refreshes, you can check the refresh history to see if fast copy was used by looking at the Engine type listed there.

If you enable the Require fast copy option, the dataflow refresh stops if fast copy can't be used for some reason. This helps you avoid waiting for a timeout and can be helpful when debugging. You can use the fast copy indicators in the query steps pane to check if your query can run with fast copy.

Screenshot showing where the fast copy indicator appears in the query steps pane.

Prerequisites

Before you can use fast copy, you'll need:

  • A Fabric capacity
  • For file data: CSV or Parquet files that are at least 100 MB and stored in Azure Data Lake Storage (ADLS) Gen2 or blob storage
  • For databases (including Azure SQL DB and PostgreSQL): 5 million rows or more of data in the data source

Note

You can bypass the threshold to force fast copy by selecting the Require fast copy setting.

Connector support

Fast copy works with these Dataflow Gen2 connectors:

  • ADLS Gen2
  • Blob storage
  • Azure SQL DB
  • Lakehouse
  • PostgreSQL
  • On-premises SQL Server
  • Warehouse
  • Oracle
  • Snowflake

Transformation limitations

When connecting to file sources, copy activity only supports these transformations:

  • Combine files
  • Select columns
  • Change data types
  • Rename a column
  • Remove a column

If you need other transformations, you can split your work into separate queries. Create one query to get the data and another query that references the first one. This way, you can use DW compute for the transformations.

For SQL sources, any transformation that's part of the native query works fine.

Output destinations

Right now, fast copy only supports loading directly to a Lakehouse destination. If you want to use a different output destination, you can stage the query first and reference it in a later query with your preferred destination.

How to use fast copy

Here's how to set up and use fast copy:

  1. In Fabric, go to a premium workspace and create a Dataflow Gen2.

  2. On the Home tab of your new dataflow, select Options:

    Screenshot showing where to select the Options for Dataflow Gen2 on the Home tab.

  3. In the Options dialog, select the Scale tab, then turn on Allow use of fast copy connectors. Close the Options dialog when you're done.

    Screenshot showing where to enable fast copy on the Scale tab of the Options dialog.

  4. Select Get data, choose the ADLS Gen2 source, and fill in the details for your container.

  5. Select the Combine button.

    Screenshot showing the Preview folder data window with the Combine option highlighted.

  6. To make sure fast copy works, only apply transformations listed in the Connector support section. If you need other transformations, stage the data first and reference the staged query in a later query. Apply your other transformations to the referenced query.

  7. (Optional) You can require fast copy for the query by right-clicking the query and selecting Require fast copy.

    Screenshot showing where to select the Require fast copy option on the right-click menu for a query.

  8. (Optional) Right now, you can only set up a Lakehouse as the output destination. For any other destination, stage the query and reference it later in another query where you can output to any source.

  9. Check the fast copy indicators to make sure your query can run with fast copy. If it can, the Engine type shows CopyActivity.

    Screenshot showing the refresh details indicating the pipeline CopyActivity engine was used.

  10. Publish the dataflow.

  11. After the refresh completes, check to confirm that fast copy was used.

How to split your query to use fast copy

When you're working with large amounts of data, you can get the best performance by using fast copy to ingest data into staging first, then transform it at scale with SQL DW compute.

Fast copy indicators help you figure out how to split your query into two parts: data ingestion to staging and large-scale transformation with SQL DW compute. Try to push as much of your query evaluation to fast copy as possible for data ingestion. When the fast copy indicators show that the remaining steps can't run with fast copy, you can split the rest of the query with staging enabled.

Step diagnostics indicators

Indicator Icon Description
This step will be evaluated with fast copy The fast copy indicator shows that the query up to this step supports fast copy.
This step isn't supported by fast copy The fast copy indicator shows that this step doesn't support fast copy.
One or more steps in your query aren't supported by fast copy The fast copy indicator shows that some steps in this query support fast copy, while others don't. To optimize, split the query: yellow steps (potentially supported by fast copy) and red steps (not supported).

Step-by-step guidance

After you complete your data transformation logic in Dataflow Gen2, the fast copy indicator evaluates each step to figure out how many steps can use fast copy for better performance.

In this example, the last step shows a red icon, which means the Group By step isn't supported by fast copy. However, all the previous steps with yellow icons can potentially be supported by fast copy.

Screenshot showing the contents of the first query with the last step in red.

If you publish and run your Dataflow Gen2 at this point, it won't use the fast copy engine to load your data.

Screenshot showing the result of the query without fast copy enabled.

To use the fast copy engine and improve your Dataflow Gen2 performance, you can split your query into two parts: data ingestion to staging and large-scale transformation with SQL DW compute. Here's how:

  1. Delete any transformations showing red icons (which means they aren't supported by fast copy) along with the destination (if you defined one).

    Screenshot showing the first query, where you deleted any steps that don't support fast copy.

  2. The fast copy indicator now shows green for the remaining steps, which means your first query can use fast copy for better performance.

    Right-click on your first query, select Enable staging, then right-click on your first query again and select Reference.

    Screenshot showing the selections required to reference your fast copy query with a second query.

  3. In your new referenced query, add back the "Group By" transformation and the destination (if applicable).

  4. Publish and refresh your Dataflow Gen2. You now have two queries in your Dataflow Gen2, and the overall duration is shorter.

    • The first query ingests data into staging using fast copy.

    • The second query does large-scale transformations using SQL DW compute.

      Screenshot of the details of the run state showing the results of the query.

    The first query details:

    Screenshot showing the results of the data ingestion.

    The second query details:

    Screenshot showing the results of the transform steps.

Known limitations

Here are the current limitations for fast copy:

  • You need an on-premises data gateway version 3000.214.2 or newer to support fast copy.
  • Fixed schema isn't supported.