0

i have a dataframe with multiple columns, and I need to select 2 of them and dump them to a list, and i've tried the following :

df.show()
+------------------------------------+---------------+---------------+
|email_address                       |topic          |user_id        |
+------------------------------------+---------------+---------------+
|[email protected]                        |hello_world    |xyz123         |
+------------------------------------+---------------+---------------+
|[email protected]                        |hello_kitty    |lmn456         |
+------------------------------------+---------------+---------------+

the result I need is a list of tuples:

[([email protected], xyz123), ([email protected], lmn456)]

the way I tried:

tuples = df.select(col('email_address'), col('topic')).rdd.flatMap(lambda x, y: list(x, y)).collect()

and it throw errors:

Py4JJavaError  Traceback (most recent call last)
<command-4050677552755250> in <module>()

--> 114 tuples = df.select(col('email_address'), col('topic')).rdd.flatMap(lambda x, y: list(x, y)).collect()
    115 
    116 

/databricks/spark/python/pyspark/rdd.py in collect(self)
    829         # Default path used in OSS Spark / for non-credential passthrough clusters:
    830         with SCCallSiteSync(self.context) as css:
--> 831             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    832         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    833 

/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)

how to fix it?

2
  • function in flatMap takes only one argument. for your task, use a list comprehension should be enough. tuples = [ (r.email_address, r.user_id) for r in df.select('email_address', 'user_id').collect() ] Commented Dec 19, 2019 at 18:50
  • or tuples = [*map(tuple, df.select('email_address', 'user_id').collect())] Commented Dec 19, 2019 at 18:50

1 Answer 1

1

You should be using map for that:

tuples = df.select(col('email_address'), col('topic')) \
           .rdd \
           .map(lambda x: (x[0], x[1])) \
           .collect()

print(tuples)

# output
[('[email protected]', 'hello_world'), ('[email protected]', 'hello_kitty')]

Another way is to collect rows for DataFrame and then loop to get values:

rows = df.select(col('email_address'), col('topic')).collect()

tuples = [(r.email_address, r.topic) for r in rows]
print(tuples)

# output
[('[email protected]', 'hello_world'), ('[email protected]', 'hello_kitty')]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.