0

In order to deploy DLT tables I am using yaml files that define Delta Live Tables Pipeline. Here is an example configuration.

resources:
  pipelines:
    bronze:
      name: ${var.stage_name}_bronze
      clusters:
        - label: default
          autoscale: ${var.default_dlt_cluster.autoscale}
          spark_conf: ${var.default_dlt_cluster.spark_conf}
      libraries:
        - notebook:
            path: ${workspace.file_path}/bronze
      target: ${var.schema_suffix}bronze
      development: false
      catalog: ${var.default_catalog}

Given the API documentation and the databricks docs I can't find a clear way to define external pypi dependencies directly in the pipeline definition. The suggested approach seems to be adding dependencies using %pip install xyz on top of the notebook, but that feels suboptimal to manage requirements, especially for ensuring consistent versions and reproducibility across environments. Am I missing a better way to manage external dependencies for DLT pipelines? If so, what is the recommended approach?

3 Answers 3

1

The answer is quite simple: You are not missing anything - the official way to do it is via "%pip install".

Having that said, i once played around with cluster policies in that regard. The idea was to define external dependencies as cluster policy and then use the policy in DLT pipelines.

That seemed to work basically, BUT it also caused a new issue in my case: It led to the DLT cluster being newly provisioned/started on every new run, which negates the whole "development mode" feature of DLT.

Sign up to request clarification or add additional context in comments.

Comments

0

We are using task libraries for this purpose: https://docs.databricks.com/api/workspace/jobs/create#tasks-libraries

We have this defined in json but I suppose with yaml it would like something like this:

tasks:
    - notebook_task:
        notebook_path: path
        libraries:
            - pypi:
                package: openpyxl==3.1.2

Yes, we have cluster libraries defined in our project as well but probably this field is not there and just ignored. Could not find it in the documentation.

2 Comments

Thanks for the answer, but this only works for workflows not dlt pipelines.
Oh, okay, @mizzlosis, I haven't worked with dlt pipelines, didn't know they are something different.
0

As a workaround for now I went with installing the dependencies via an init script on the cluster.

resources:
  pipelines:
    bronze:
      name: ${var.stage_name}_bronze
      clusters:
        - label: default
          autoscale: ${var.default_dlt_cluster.autoscale}
          spark_conf: ${var.default_dlt_cluster.spark_conf}
          init_scripts: 
            - workspace:
                destination: ${workspace.file_path}/resources/init_scripts/cluster_dependencies.sh
[...]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.