How to read a Parquet file into Pandas DataFrame?

Question

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

Do you happen to have the data openly available? My branch of python-parquet github.com/martindurant/parquet-python/tree/py3 had a pandas reader in parquet.rparquet, you could try it. There are many parquet constructs it cannot handle. — mdurant
– mdurant, Commented Nov 19, 2015 at 21:21
Wait for the Apache Arrow project that the Pandas author Wes Mckinney is part of. wesmckinney.com/blog/pandas-and-apache-arrow After it is done, users should be able to read in Parquet file directly from Pandas. — XValidated
– XValidated, Commented Apr 9, 2016 at 0:36
Since the question is closed as off-topic (but still the first result on Google) I have to answer in a comment.. You can now use pyarrow to read a parquet file and convert it to a pandas DataFrame: import pyarrow.parquet as pq; df = pq.read_table('dataset.parq').to_pandas() — sroecker
– sroecker, Commented May 27, 2017 at 11:34
Kinda annoyed that this question was closed. Spark and parquet are (still) relatively poorly documented. Am also looking for the answer to this. — user48956
– user48956, Commented Jul 6, 2017 at 16:40
Both the fastparquet and pyarrow libraries make it possible to read a parquet file into a pandas dataframe: github.com/dask/fastparquet and arrow.apache.org/docs/python/parquet.html — ogrisel
– ogrisel, Commented Oct 11, 2017 at 9:07

Zags · Accepted Answer · 2022-05-06 22:11:27Z

252

pandas 0.21 introduces new functions for Parquet:

import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

edited May 6, 2022 at 22:11

Zags

41.9k16 gold badges125 silver badges157 bronze badges

answered Oct 31, 2017 at 13:12

chrisaycock

38.1k15 gold badges94 silver badges128 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Chau Pham Over a year ago

For most of my data, 'fastparquet' is a bit faster. Just in case pd.read_parquet() returns a problem with Snappy Error, run conda install python-snappy to install snappy.

Seb Over a year ago

I found pyarrow to be too difficult to install (both on my local windows machine and on a cloud linux machine). Even after the python-snappy fix, there were additional issues with the compiler as well as the error module 'pyarrow' has no attribute 'compat'. fastparquet had no issues at all.

Khan Over a year ago

@Catbuilts You can use gzip if you don't have snappy.

wawawa Over a year ago

can 'fastparquet' read ',snappy.parquet' file?

Mark Z. Over a year ago

I had the opposite experience vs. @Seb. fastparquet had a bunch of issues, pyarrow was simple pip install and off I went

|

danielfrg · Accepted Answer · 2017-12-14 23:16:21Z

23

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/

There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python

It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

edited Dec 14, 2017 at 23:16

answered Nov 19, 2015 at 20:46

danielfrg

2,8972 gold badges24 silver badges23 bronze badges

4 Comments

bluszcz Over a year ago

Actually there is pyarrow which allows both reads / writes: pyarrow.readthedocs.io/en/latest/parquet.html

snooze_bear Over a year ago

I get a permission denied error when I try to follow your link, @bluszcz -- do you have an alternate?

ogrisel Over a year ago

parquet-python is much slower than alternatives such as fastparquet et pyarrow: arrow.apache.org/docs/python/parquet.html

ogrisel Over a year ago

pd.read_parquet is now part of pandas. The other answer should be marked as valid.

WY Hsu · Accepted Answer · 2019-12-28 10:04:30Z

19

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe

The code is simple, just type:

import pyarrow.parquet as pq

df = pq.read_table(source=your_file_path).to_pandas()

For more information, see the document from Apache pyarrow Reading and Writing Single Files

answered Dec 28, 2019 at 10:04

WY Hsu

1,9252 gold badges22 silver badges33 bronze badges

Comments

Gianfranco P · Accepted Answer · 2025-03-24 23:19:01Z

15

Parquet

Step 1: Data to play with

df = pd.DataFrame({
    'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
    'marks': [20,10,22,21,22],
})

Step 2: Save as Parquet

df.to_parquet('sample.parquet')

Step 3: Read from Parquet

df = pd.read_parquet('sample.parquet')
print(df.head())

edited Mar 24 at 23:19

Gianfranco P

11k7 gold badges55 silver badges70 bronze badges

answered Aug 26, 2021 at 7:27

Harish Masand

3592 silver badges4 bronze badges

Comments

BSalita · Accepted Answer · 2021-05-08 08:24:56Z

4

When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.

df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')

answered May 8, 2021 at 8:24

BSalita

9,07111 gold badges59 silver badges75 bronze badges

Comments

RaaHul Dutta · Accepted Answer · 2021-04-27 10:30:07Z

2

Parquet files are always large. so read it using dask.

import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob

files = glob.glob('data/*.parquet')

@delayed
def load_chunk(path):
    return ParquetFile(path).to_pandas()

df = dd.from_delayed([load_chunk(f) for f in files])

df.compute()

answered Apr 27, 2021 at 10:30

RaaHul Dutta

1251 silver badge4 bronze badges

Comments

Gonçalo Peres · Accepted Answer · 2022-11-08 20:51:58Z

1

Considering the .parquet file named data.parquet

parquet_file = '../data.parquet'

open( parquet_file, 'w+' )

Convert to Parquet

Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows

parquet_df.to_parquet(parquet_file)

Read from Parquet

In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows

new_parquet_df = pd.read_parquet(parquet_file)

edited Nov 8, 2022 at 20:51

answered May 14, 2021 at 15:14

Gonçalo Peres

13.8k5 gold badges73 silver badges95 bronze badges

Comments

UpAndAdam · Accepted Answer · 2024-02-29 16:50:26Z

0

you can use python to get parquet data

install package
pip install pandas pyarrow
read file

def read_parquet(file):
    result = []
    data = pd.read_parquet(file)
    for index in data.index:
        res = data.loc[index].values[0:-1]
        result.append(res)
    print(len(result))


file = "./data.parquet"
read_parquet(file)

edited Feb 29, 2024 at 16:50

UpAndAdam

5,5075 gold badges33 silver badges49 bronze badges

answered Mar 1, 2023 at 12:49

Wollens

3773 silver badges6 bronze badges

2 Comments

Roobie Nuby Over a year ago

that is "pip install", not "pin install". I would fix it, but small changes are not allowed, despite that fact that a one letter change means the difference between a program not running with lots of confusion, and everything working.

Wollens Over a year ago

ok, i print error, fixed thank you!

Collectives™ on Stack Overflow

How to read a Parquet file into Pandas DataFrame?

8 Answers 8

7 Comments

4 Comments

Comments

Parquet

Step 1: Data to play with

Step 2: Save as Parquet

Step 3: Read from Parquet

Comments

Comments

Comments

Convert to Parquet

Read from Parquet

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

7 Comments

4 Comments

Comments

Parquet

Step 1: Data to play with

Step 2: Save as Parquet

Step 3: Read from Parquet

Comments

Comments

Comments

Convert to Parquet

Read from Parquet

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related