0

I read data from Databricks

import pandas as pd
import joblib
query = 'select * from table a"
df = spark.sql(query)
df = df.toPandas()

df.to_pickle('df.pickle')
joblib.dump(df, 'df.joblib')

Then I try to read what I saved on my local PC (either with pickle or joblib).

import joblib
import pandas as pd
df = joblib.load('df.joblib')

and end up with the following error:

ModuleNotFoundError: No module named 'pyspark.sql.metrics'
Cell In[8], line 1
----> 1 df = joblib.load('Data/df.joblib')
Hide Traceback
File ~\myenv\Lib\site-packages\joblib\numpy_pickle.py:658, in load(filename, mmap_mode)
    652             if isinstance(fobj, str):
    653                 # if the returned file object is a string, this means we
    654                 # try to load a pickle file generated with an version of
    655                 # Joblib so we load it with joblib compatibility function.
    656                 return load_compatibility(fobj)
--> 658             obj = _unpickle(fobj, filename, mmap_mode)
    659 return obj

File ~\myenv\Lib\site-packages\joblib\numpy_pickle.py:577, in _unpickle(fobj, filename, mmap_mode)
    575 obj = None
    576 try:
--> 577     obj = unpickler.load()
    578     if unpickler.compat_mode:
    579         warnings.warn("The file '%s' has been generated with a "
    580                       "joblib version less than 0.10. "
    581                       "Please regenerate this pickle file."
    582                       % filename,
    583                       DeprecationWarning, stacklevel=3)

File ~\AppData\Local\Programs\Python\Python311\Lib\pickle.py:1213, in _Unpickler.load(self)
   1211             raise EOFError
   1212         assert isinstance(key, bytes_types)
-> 1213         dispatch[key[0]](self)
   1214 except _Stop as stopinst:
   1215     return stopinst.value

File ~\AppData\Local\Programs\Python\Python311\Lib\pickle.py:1538, in _Unpickler.load_stack_global(self)
   1536 if type(name) is not str or type(module) is not str:
   1537     raise UnpicklingError("STACK_GLOBAL requires str")
-> 1538 self.append(self.find_class(module, name))

File ~\AppData\Local\Programs\Python\Python311\Lib\pickle.py:1580, in _Unpickler.find_class(self, module, name)
   1578     elif module in _compat_pickle.IMPORT_MAPPING:
   1579         module = _compat_pickle.IMPORT_MAPPING[module]
-> 1580 __import__(module, level=0)
   1581 if self.proto >= 4:
   1582     return _getattribute(sys.modules[module], name)[0]

Is there way to fix it?

6
  • 1
    Unrelated to the question, but your line joblib.dump('df.joblib') is wrong. The documentation notes two required arguments. You are passing the filename, but you also need to pass the value. So you want joblib.dump(df, 'df.joblib'). Commented Nov 13 at 20:35
  • 1
    Well it's saying it needs pyspark.sql.metrics. Did you install pyspark on your local machine? Commented Nov 13 at 22:08
  • @jqurious, I guess I have to install it. I feel it is bug that Databricks embedded some unnecessary object into Pandas data frame during serialization. Commented Nov 14 at 15:06
  • @jqurious, I just checked, pysparked is installed on my local machine. I even added import pyspark into my code, but still got the same error. Commented Nov 14 at 15:08
  • Strange. I tried using metrics locally and serializing the frame and cannot replicate pyspark.sql.metrics ending up in the binary file. Commented Nov 14 at 16:35

1 Answer 1

1
  1. Get a list of all installed packages on your local maschine How do I get a list of locally installed Python modules?).

  2. Install probably the missing package pyspark.sql.metrics (which the way you normally install packages)

If this does not work try this old topic: How to load a joblib file with custom class previously saved using a notebook?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.