I read data from Databricks
import pandas as pd
import joblib
query = 'select * from table a"
df = spark.sql(query)
df = df.toPandas()
df.to_pickle('df.pickle')
joblib.dump(df, 'df.joblib')
Then I try to read what I saved on my local PC (either with pickle or joblib).
import joblib
import pandas as pd
df = joblib.load('df.joblib')
and end up with the following error:
ModuleNotFoundError: No module named 'pyspark.sql.metrics'
Cell In[8], line 1
----> 1 df = joblib.load('Data/df.joblib')
Hide Traceback
File ~\myenv\Lib\site-packages\joblib\numpy_pickle.py:658, in load(filename, mmap_mode)
652 if isinstance(fobj, str):
653 # if the returned file object is a string, this means we
654 # try to load a pickle file generated with an version of
655 # Joblib so we load it with joblib compatibility function.
656 return load_compatibility(fobj)
--> 658 obj = _unpickle(fobj, filename, mmap_mode)
659 return obj
File ~\myenv\Lib\site-packages\joblib\numpy_pickle.py:577, in _unpickle(fobj, filename, mmap_mode)
575 obj = None
576 try:
--> 577 obj = unpickler.load()
578 if unpickler.compat_mode:
579 warnings.warn("The file '%s' has been generated with a "
580 "joblib version less than 0.10. "
581 "Please regenerate this pickle file."
582 % filename,
583 DeprecationWarning, stacklevel=3)
File ~\AppData\Local\Programs\Python\Python311\Lib\pickle.py:1213, in _Unpickler.load(self)
1211 raise EOFError
1212 assert isinstance(key, bytes_types)
-> 1213 dispatch[key[0]](self)
1214 except _Stop as stopinst:
1215 return stopinst.value
File ~\AppData\Local\Programs\Python\Python311\Lib\pickle.py:1538, in _Unpickler.load_stack_global(self)
1536 if type(name) is not str or type(module) is not str:
1537 raise UnpicklingError("STACK_GLOBAL requires str")
-> 1538 self.append(self.find_class(module, name))
File ~\AppData\Local\Programs\Python\Python311\Lib\pickle.py:1580, in _Unpickler.find_class(self, module, name)
1578 elif module in _compat_pickle.IMPORT_MAPPING:
1579 module = _compat_pickle.IMPORT_MAPPING[module]
-> 1580 __import__(module, level=0)
1581 if self.proto >= 4:
1582 return _getattribute(sys.modules[module], name)[0]
Is there way to fix it?
joblib.dump('df.joblib')is wrong. The documentation notes two required arguments. You are passing the filename, but you also need to pass the value. So you wantjoblib.dump(df, 'df.joblib').pyspark.sql.metrics. Did you installpysparkon your local machine?import pysparkinto my code, but still got the same error.pyspark.sql.metricsending up in the binary file.