192

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

8
  • 3
    Do you happen to have the data openly available? My branch of python-parquet github.com/martindurant/parquet-python/tree/py3 had a pandas reader in parquet.rparquet, you could try it. There are many parquet constructs it cannot handle. Commented Nov 19, 2015 at 21:21
  • 4
    Wait for the Apache Arrow project that the Pandas author Wes Mckinney is part of. wesmckinney.com/blog/pandas-and-apache-arrow After it is done, users should be able to read in Parquet file directly from Pandas. Commented Apr 9, 2016 at 0:36
  • 4
    Since the question is closed as off-topic (but still the first result on Google) I have to answer in a comment.. You can now use pyarrow to read a parquet file and convert it to a pandas DataFrame: import pyarrow.parquet as pq; df = pq.read_table('dataset.parq').to_pandas() Commented May 27, 2017 at 11:34
  • 4
    Kinda annoyed that this question was closed. Spark and parquet are (still) relatively poorly documented. Am also looking for the answer to this. Commented Jul 6, 2017 at 16:40
  • 2
    Both the fastparquet and pyarrow libraries make it possible to read a parquet file into a pandas dataframe: github.com/dask/fastparquet and arrow.apache.org/docs/python/parquet.html Commented Oct 11, 2017 at 9:07

8 Answers 8

252

pandas 0.21 introduces new functions for Parquet:

import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

Sign up to request clarification or add additional context in comments.

7 Comments

For most of my data, 'fastparquet' is a bit faster. Just in case pd.read_parquet() returns a problem with Snappy Error, run conda install python-snappy to install snappy.
I found pyarrow to be too difficult to install (both on my local windows machine and on a cloud linux machine). Even after the python-snappy fix, there were additional issues with the compiler as well as the error module 'pyarrow' has no attribute 'compat'. fastparquet had no issues at all.
@Catbuilts You can use gzip if you don't have snappy.
can 'fastparquet' read ',snappy.parquet' file?
I had the opposite experience vs. @Seb. fastparquet had a bunch of issues, pyarrow was simple pip install and off I went
|
23

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

4 Comments

Actually there is pyarrow which allows both reads / writes: pyarrow.readthedocs.io/en/latest/parquet.html
I get a permission denied error when I try to follow your link, @bluszcz -- do you have an alternate?
parquet-python is much slower than alternatives such as fastparquet et pyarrow: arrow.apache.org/docs/python/parquet.html
pd.read_parquet is now part of pandas. The other answer should be marked as valid.
19

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe

The code is simple, just type:

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files

Comments

15

Parquet

Step 1: Data to play with

df = pd.DataFrame({
    'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
    'marks': [20,10,22,21,22],
})

Step 2: Save as Parquet

df.to_parquet('sample.parquet')

Step 3: Read from Parquet

df = pd.read_parquet('sample.parquet')
print(df.head())

Comments

4

When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.

df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')

Comments

2

Parquet files are always large. so read it using dask.

import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob

files = glob.glob('data/*.parquet')

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in files])

df.compute()

Comments

1

Considering the .parquet file named data.parquet

parquet_file = '../data.parquet'

open( parquet_file, 'w+' )

Convert to Parquet

Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows

parquet_df.to_parquet(parquet_file)

Read from Parquet

In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows

new_parquet_df = pd.read_parquet(parquet_file)

Comments

0

you can use python to get parquet data

  1. install package
    pip install pandas pyarrow

  2. read file

def read_parquet(file):
    result = []
    data = pd.read_parquet(file)
    for index in data.index:
        res = data.loc[index].values[0:-1]
        result.append(res)
    print(len(result))


file = "./data.parquet"
read_parquet(file)

2 Comments

that is "pip install", not "pin install". I would fix it, but small changes are not allowed, despite that fact that a one letter change means the difference between a program not running with lots of confusion, and everything working.
ok, i print error, fixed thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.