Edit

Share via


Access mirrored Cosmos DB data from Lakehouse in Microsoft Fabric

Microsoft Fabric Lakehouse is a data architecture platform for storing, managing, and analyzing structured and unstructured data in a single location. In this guide, you access your mirrored Cosmos DB in Microsoft Fabric data in a lakehouse. You then use a notebook to perform a basic query of that date.

Prerequisites

Open the SQL analytics endpoint for the database

Start by accessing the SQL analytics endpoint for the Cosmos DB in Fabric database to ensure that mirroring ran successfully at least once.

  1. Open the Fabric portal (https://app.fabric.microsoft.com).

  2. Navigate to your existing Cosmos DB database.

    Important

    For this guide, the existing Cosmos DB database has the sample data set already loaded. The remaining query examples in this guide assume that you're using the same data set for this database.

  3. In the menu bar, select the Cosmos DB list and then select SQL Endpoint.

    Screenshot of the endpoint selection option in the menu bar for a database in Cosmos DB in Fabric.

  4. Once you're able to successfully navigate to the SQL analytics endpoint, this navigation step confirms that mirroring ran successfully at least once.

Connect database to a lakehouse

Next, use Lakehouse to extend the number of tools you can use to analyze your Cosmos DB data. In this step, create a lakehouse and connect it to your mirrored data.

  1. Navigate to the Fabric portal home page.

  2. Select the Create option.

    Screenshot of the option to 'Create' a new resource in the Fabric portal.

  3. If the option to create an Lakehouse account isn't initially available, select See all.

  4. Within the Data Engineering category, select Lakehouse.

    Screenshot of the option to specifically create a lakehouse in the Fabric portal.

  5. Give the lakehouse a unique name and then select Create.

    Screenshot of the dialog to name a new lakehouse in the Fabric portal.

  6. In the newly created lakehouse's menu, select the Get data option, and then select New shortcut.

  7. Follow the sequential instructions in the various New shortcut dialogs to select your existing mirrored Cosmos DB database, and then select your target table.

    Important

    This guide assumes that you're selecting the SampleData table that's available when you mirror a Cosmos DB database that has the sample data set preloaded.

Run a Spark query in a notebook

Finally, use Spark within a notebook to write Python queries for your mirrored data that is connected to the lakehouse. For this last step, create a notebook and then run a baseline Spark query using the Transact SQL (T-SQL) language syntax.

  1. In the lakehouse menu, select the Open notebook category and then select New notebook.

  2. In the newly created notebook, create a new PySpark (Python) cell.

  3. Test a SQL query using a combination of the display and spark.sql functions in PySpark. Enter this code into the cell.

    display(spark.sql("""
    SELECT countryOfOrigin AS geography, COUNT(*) AS itemCount
    FROM SampleData
    GROUP BY countryOfOrigin
    ORDER BY itemCount DESC
    LIMIT 5
    """))
    

    Important

    This query uses data found in the sample data set. For more information, see sample data set.

  4. Run the notebook cell.

  5. Observe the output from running the notebook cell. The results are rendered in tabular format.

    geography itemCount
    Nigeria 21
    Egypt 20
    France 18
    Japan 18
    Argentina 17

    Screenshot of the notebook interface with a single cell and query results in tabular format.