-1

I have a PySpark DataFrame that contains a single row but multiple columns (in context of sql where clause). It just like column start_date with value >date("2025-01-01") then new column is start_date > date("2025-01-01")

Here’s an example of the DataFrame:
foo bar
baz bim

What I want to achieve:
I want to transform this into a new DataFrame where each column name, its corresponding value are combined into a single string, each such combination becomes a new row.

The desired output should be like :
new_column1 new_column2
foo = baz    foo is baz
bar: bim     bar is null

Additional Requirements:
The number of columns so dynamic, can vary.
Attention only one row but real user-cases, there more.
I prefer a solution that pure PySpark, without using pandas.

What I have tried:
I explored using selectExpr manually on column names,
I tried to use explode, but I dont know how to first create an array combining column name + value dynamically.

1 Answer 1

0

It is not clear why there is a : in bar:bim , should it be bar = bim instead, following the pattern of foo = baz.

With below dataset I have added another extra column since you mentioned columns are dynamic

+---+---+-------------+
|foo|bar|another_extra|
+---+---+-------------+
|baz|bim|        extra|
+---+---+-------------+

You can create a custom function to iterate through the columns and prepare the value of new columns

def combine_col_row(df):
    columns = df.columns
    new_df = []
    for column in columns:
        value = df.select(column).collect()[0][0]  
        if value is None:  
            value = "null"
        print(value)
        new_column1 = f"{column} = {value}"
        new_column2 = f"{column} is {value}"
        new_df.append((new_column1, new_column2))
    new_df = spark.createDataFrame(new_df, ["new_column1", "new_column2"])
    
    return new_df

result_df = combine_col_row(df)

Output

+---------------------+----------------------+
|new_column1          |new_column2           |
+---------------------+----------------------+
|foo = baz            |foo is baz            |
|bar = bim            |bar is bim            |
|another_extra = extra|another_extra is extra|
+---------------------+----------------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your response, As stated in my title, actually I have a DataFrame with two columns: input and condition. input is a filter field, and condition is the filtering condition, similar to a WHERE clause in SQL. Therefore, it’s not just a simple matter of concatenating the two columns with "is" or "=" or ":" — there are many more cases to handle.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.