How to define external dependencies in DLT pipeline definitions?

Question

In order to deploy DLT tables I am using yaml files that define Delta Live Tables Pipeline. Here is an example configuration.

resources:
  pipelines:
    bronze:
      name: ${var.stage_name}_bronze
      clusters:
        - label: default
          autoscale: ${var.default_dlt_cluster.autoscale}
          spark_conf: ${var.default_dlt_cluster.spark_conf}
      libraries:
        - notebook:
            path: ${workspace.file_path}/bronze
      target: ${var.schema_suffix}bronze
      development: false
      catalog: ${var.default_catalog}

Given the API documentation and the databricks docs I can't find a clear way to define external pypi dependencies directly in the pipeline definition. The suggested approach seems to be adding dependencies using %pip install xyz on top of the notebook, but that feels suboptimal to manage requirements, especially for ensuring consistent versions and reproducibility across environments. Am I missing a better way to manage external dependencies for DLT pipelines? If so, what is the recommended approach?

Thomas · Accepted Answer · 2025-01-27 19:02:20Z

1

The answer is quite simple: You are not missing anything - the official way to do it is via "%pip install".

Having that said, i once played around with cluster policies in that regard. The idea was to define external dependencies as cluster policy and then use the policy in DLT pipelines.

That seemed to work basically, BUT it also caused a new issue in my case: It led to the DLT cluster being newly provisioned/started on every new run, which negates the whole "development mode" feature of DLT.

answered Jan 27 at 19:02

Thomas

461 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

havryliuk · Accepted Answer · 2025-01-29 09:37:24Z

0

We are using task libraries for this purpose: https://docs.databricks.com/api/workspace/jobs/create#tasks-libraries

We have this defined in json but I suppose with yaml it would like something like this:

tasks:
    - notebook_task:
        notebook_path: path
        libraries:
            - pypi:
                package: openpyxl==3.1.2

Yes, we have cluster libraries defined in our project as well but probably this field is not there and just ignored. Could not find it in the documentation.

answered Jan 29 at 9:37

havryliuk

3282 silver badges13 bronze badges

2 Comments

mizzlosis Jan 29 at 14:18

Thanks for the answer, but this only works for workflows not dlt pipelines.

havryliuk Jan 30 at 15:26

Oh, okay, @mizzlosis, I haven't worked with dlt pipelines, didn't know they are something different.

mizzlosis · Accepted Answer · 2025-01-29 14:22:26Z

0

As a workaround for now I went with installing the dependencies via an init script on the cluster.

resources:
  pipelines:
    bronze:
      name: ${var.stage_name}_bronze
      clusters:
        - label: default
          autoscale: ${var.default_dlt_cluster.autoscale}
          spark_conf: ${var.default_dlt_cluster.spark_conf}
          init_scripts: 
            - workspace:
                destination: ${workspace.file_path}/resources/init_scripts/cluster_dependencies.sh
[...]

answered Jan 29 at 14:22

mizzlosis

5971 gold badge5 silver badges19 bronze badges

Collectives™ on Stack Overflow

How to define external dependencies in DLT pipeline definitions?

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related