Quantcast
Channel: Simple Talk
Viewing all articles
Browse latest Browse all 266

High Concurrency Data Pipelines in Fabric

$
0
0

Data Pipelines can orchestrate many activities, creating a flow for data ingestion. One of these activities is the notebook execution activity.

However, every time a data pipelines executes a notebook, it creates a completely new session and spark pool.

This makes the Data Pipeline very slow and expensive.

How bad it can be

Imagine your pipeline will run a notebook inside a loop. The loop executes the notebook many times.

Each execution means a completely new spark pool. This is expensive.

A screenshot of a computer

Description automatically generated

Besides being expensive, the default configurations for a spark session and a capacity will not support this running in parallel. You will need to limit the number of parallel notebook executions, using the ForEach activity, like in the image below

A screenshot of a computer

Description automatically generated

High Concurrency to the Rescue

The solution is to enable High Concurrency for Data Pipelines running notebooks. This can be done in two steps:

  • Enable this configuration in the workspace settings
  • Configure the session tag in the notebook activity

In the workspace settings, you find this option to be enabled in Spark Settings, like in the image below:

A screenshot of a computer

Description automatically generated

After that, the Session Tag configuration defines which notebook activities will use this feature or not. You can create groups of notebook activities running each group in a different session. You can use any string as “Session Tag”

A screenshot of a computer

Description automatically generated

The High Concurrency Results

The image below shows a comparison between the execution without high concurrency and with high concurrency.

The execution time dropped from almost 13 minutes to less than 3.

A screenshot of a computer

Description automatically generated

References

Fabric Monday 55: Pipelines High Concurrency to Save Yout Time and Money

Summary

If you plan to orchestrate notebooks using Data Pipelines, the High Concurrency configuration is essential for you

The post High Concurrency Data Pipelines in Fabric appeared first on Simple Talk.


Viewing all articles
Browse latest Browse all 266

Trending Articles