0

So I have an Azure Data Factory ingestion that's using a Databricks notebook to parse illegal column name characters before saving it. This eventually put into a separate database.table. The code I decided on works, but is really inefficient....like five hours for an excel sheet that has over 350 columns. I need another approach that is more efficient to cut run times down.

    #Replace illegal column names
for column in df.columns:
            df = df.withColumnRenamed(column, column.lstrip())
for column in df.columns:
            df = df.withColumnRenamed(column, column.rstrip())
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace(">", "greaterthan"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("?", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("!", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("#", "number"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("&", "and"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("$", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("/", "_"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("-", "_"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace(",", "_"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("(", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace(")", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("{", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("}", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("=", "equals"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("\n", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("\t", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("'", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace(".", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("+", "plus"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace(":", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("...", ""))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace(" ", "_"))
for column in df.columns:
            df = df.withColumnRenamed(column, column.replace("__", ""))

1 Answer 1

1

Your way is inefficient because it has a lot of unnecessary iterations, you can rename the original column applying all rules in a single iteration, following this idea:

import re

re_blank = re.compile(r'''[?!$(){}\n\t':._]''')
re_underscore = re.compile(r'''[-/, ]''')

for column in df.columns:
    renamed = (column
               .strip()
               .replace(">", "greaterthan")
               .replace("#", "number")
               .replace("&", "and")
               .replace("=", "equals")
               .replace("+", "plus"))

    renamed = re_blank.sub('', renamed)
    renamed = re_underscore.sub('_', renamed)
    renamed = renamed.replace("__", "")
    df = df.withColumnRenamed(column, renamed)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.