10

I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.

import numpy as np
import pandas as pd

input_file_name = 'test.hst'

input_file = open(input_file_name, 'rb')
header = input_file.read(96)

dt_header = np.dtype([('version', 'i4'),
                      ('copyright', 'S64'),
                      ('symbol', 'S12'),
                      ('period', 'i4'),
                      ('digits', 'i4'),
                      ('timesign', 'i4'),
                      ('last_sync', 'i4')])

header = np.fromstring(header, dt_header)

dt_records = np.dtype([('ctm', 'i4'),
                       ('open', 'f8'),
                       ('low', 'f8'),
                       ('high', 'f8'),
                       ('close', 'f8'),
                       ('volume', 'f8')])
records = np.fromfile(input_file, dt_records)

input_file.close()

df_records = pd.DataFrame(records)
# Now, do some changes in the individual values of df_records
# and then write it back to a binary file

Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.

1
  • Whats wrong with using the records.tofile method. It writes the data in the same format as it is stored in memory. Commented Oct 14, 2014 at 6:40

2 Answers 2

11

Pandas now offers a wide variety of formats:

Format Type Data Description     Reader         Writer
text        CSV                  read_csv       to_csv
text        JSON                 read_json      to_json
text        HTML                 read_html      to_html
text        Local clipboard      read_clipboard to_clipboard
binary      MS Excel             read_excel     to_excel
binary      HDF5 Format          read_hdf       to_hdf
binary      Feather Format       read_feather   to_feather
binary      Parquet Format       read_parquet   to_parquet
binary      Msgpack              read_msgpack   to_msgpack
binary      Stata                read_stata     to_stata
binary      SAS                  read_sas    
binary      Python Pickle Format read_pickle    to_pickle
SQL         SQL                  read_sql       to_sql
SQL         Google Big Query     read_gbq       to_gbq

For small to medium sized files, I prefer CSV, as properly-formatted CSV can store arbitrary string data, is human readable, and is as dirt-simple as any format can be while achieving the previous two goals.

Unfortunately, for more complex problems, the choice is harder.

If I were on Amazon AWS, I would consider using parquet. However, I do not have any experience with this format.

I can no longer favor the pickle format. Although the pickle format claims long term stability, it allows arbitrary code execution. And even if no code is actually stored in the pickle, ALL pickles execute code just to unpickle them, which is what lead the folks at huggingface to recommend first pickle-tools for scanning pickles and then safetensors as an alternative to pickles. But for saving and loading pandas tables, safetensors is not an option, as they are designed for large multidimensional arrays of floating-point values ("tensors") not for tabular data.

I do not recommend using tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).

I no longer favor the HDF5 format. It has serious risks for long-term archival since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.

If you have advice on stable, secure, binary formats for saving your own pandas data, please share them! In the meantime, I think CSV for small data and zipped CSVs to save disk space or network bandwidth when needed may be the best way to go. I personally try to avoid any pandas format that requires more than plain-text to store, if possible.

Sign up to request clarification or add additional context in comments.

1 Comment

I've added a question that surveys formats for Python data science in general: stackoverflow.com/questions/63583264/…
6

It isn't clear to me if the DataFrame is a view or a copy, but assuming it is a copy, you can use the to_records method of the DataFrame.

This gives you back a record array that you can then put to disk using tofile.

e.g.

df_records = pd.DataFrame(records)
# do some stuff
new_recarray = df_records.to_records()
new_recarray.tofile("myfile.npy")

The data will reside in memory as packed bytes with the format described by the recarray dtype.

1 Comment

You are right. I wasn't aware that this way it would write a binary file using the same data structure as the numpy array. I did had to do some extra, since the data was changed from the operations I did in the Pandas Dataframe. But after putting it all back in the needed types, it did work correctly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.