6

I'm trying to typecheck notebooks exported out of Databricks. The notebooks are plain *.py files with a special comment format to indicate where cells begin and end. There's no reason why mypy shouldn't be able to typecheck these files, except some missing names:

  • spark
  • sc
  • dbutils
  • display
  • displayHTML

I know that the python command will run a file specified by the PYTHONSTARTUP environment variable before dumping you into interactive mode. This is how these names get defined to begin with.

Is there a hook in mypy that lets you define names like these outside the code?

2 Answers 2

2

Here is what I did:

  1. Create utils/spark_utils.py with the following content:
from typing import Any, Optional

from pyspark.sql import SparkSession


def spark() -> SparkSession:
    return SparkSession.builder.getOrCreate()


def dbutils(spark_session: Optional[SparkSession] = None) -> Any:
    spark_session = spark_session or spark()
    conf = spark_session.conf.get("spark.master")
    if "local" in conf:
        from pyspark.dbutils import DBUtils  # type: ignore

        return DBUtils(spark_session)
    else:
        import IPython  # type: ignore

        return IPython.get_ipython().user_ns["dbutils"]
  1. Use these utility functions anywhere in your Databricks notebooks:
import utils.spark_utils as spark_utils

spark = spark_utils.spark()
dbutils = spark_utils.dbutils(spark)

s3_input_path = dbutils.widgets.get("s3_input_path")
df = spark.read.parquet(s3_input_path)
df.show()

Please note that pyspark.dbutils is a part of databricks-connect package, the standard pyspark doesn't have it.

Sign up to request clarification or add additional context in comments.

Comments

1

Here's an answer I came up with. It's dirty, but it works. I'd love a better answer, but until then, here's what works.

The strategy is to use a shell script to prepend the "PYTHONSTARTUP" file to each notebook, and then subtract line numbers in the final output.

typecheck.sh:

#!/bin/bash

TARGET=$1

# Define the contents of "PYTHONSTARTUP" file inline. This just
# makes it easier to copy & paste this script elsewhere. You could also 
# make it a separate *.py file.
PRELUDE="$(cat <<EOF
import typing
import pyspark.SparkContext
import pyspark.sql.SparkSession

spark = None  # type: pyspark.sql.SparkSession
sc = None  # type: pyspark.SparkContext

def display(expr):
    pass

def displayHTML(expr):
    pass

class dbutils:
    class fs:
        def help(): pass
        def cp(from_: str, to: str, recurse: bool = False) -> bool: pass
        def head(file: str, maxBytes: int) -> str: pass
        def ls(dir: str) -> typing.List[str]: pass
        def mkdirs(dir: str) -> bool: pass
        def put(file: str, contents: str, overwrite: bool = False) -> bool: pass
        def rm(dir: str, recurse: bool) -> bool: pass
        def mount(source: str, mountPoint: str, encryptionType: str = "", owner: str = "", extraConfigs: typing.Map[str, str] = {}) -> bool: pass
        def mounts() -> typing.List[str]: pass
        def refreshMounts() -> bool: pass
        def unmount(mountPoint: str) -> bool: pass
    class notebook:
        def exit(value: str): pass
        def run(path: str, timeout: int, arguments: typing.Map[str, str]) -> str: pass
    class widgets:
        def combobox(name: str, defaultValue: str, choices: typing.List[str], label: str = ""): pass
        def dropdown(name: str, defaultValue: str, choices: typing.List[str], label: str = ""): pass
        def get(name: str) -> str: pass
        def multiselect(name: str, defaultValue: str, choices: typing.List[str], label: str = ""): pass
        def remove(name: str): pass
        def removeAll(): pass
        def text(name: str, defaultValue: str, label: str = ""): pass

def getArgument(name: str) -> str: pass
EOF
)"

# Remember the length of $PRELUDE so that we can subtract the line number
LEN="$(echo "$PRELUDE" | wc -l | awk '{ print $1 }')"

for file in $(find $TARGET -name '*.py'); do
  # run mypy for the two files concatenated together (with a blank line 
  # for good measure)
  OUTPUT=$(mypy -c "$(cat <<EOF
$PRELUDE

$(cat $file)
EOF
)")
  # awk: Take only output where the line number is after the PRELUDE. Also, fix the file name and line number
  FILE_OUTPUT="$(echo "$OUTPUT" | awk -F: '$2 > '$LEN' { line=($2-'$LEN')-1; $1=""; $2=""; print "'$file':" line ":" $0 }')"

  # Remove blank lines from output before printing
  if [[ $(echo "$FILE_OUTPUT" | sed '/^$/d' | wc -l) -gt 0 ]]; then
    echo "$FILE_OUTPUT"
  fi

  # Keep track of all output, so we can decide the exit code
  ALL_OUTPUT+="$FILE_OUTPUT"
done

# propagate errors to the exit code, but ignore errors in the prelude. This 
# makes it easier to use in a CI pipeline.
if [[ $(echo "$ALL_OUTPUT" | wc -l) -gt 1 ]]; then
  exit 1
else
  exit 0
fi

Usage:

./typecheck.sh notebooks/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.