Pandas convert dataframe to array of tuples

Question

I have manipulated some data using pandas and now I want to carry out a batch save back to the database. This requires me to convert the dataframe into an array of tuples, with each tuple corresponding to a "row" of the dataframe.

My DataFrame looks something like:

In [182]: data_set
Out[182]: 
  index data_date   data_1  data_2
0  14303 2012-02-17  24.75   25.03 
1  12009 2012-02-16  25.00   25.07 
2  11830 2012-02-15  24.99   25.15 
3  6274  2012-02-14  24.68   25.05 
4  2302  2012-02-13  24.62   24.77 
5  14085 2012-02-10  24.38   24.61

I want to convert it to an array of tuples like:

[(datetime.date(2012,2,17),24.75,25.03),
(datetime.date(2012,2,16),25.00,25.07),
...etc. ]

Any suggestion on how I can efficiently do this?

For those coming to this answer in 2017+, there is a new idiomatic solution below. You can just use list(df.itertuples(index=False, name=None)) — Ted Petrou
– Ted Petrou, Commented Nov 6, 2017 at 16:45
The two things I'm looking for when I come to this question: A list of tuples - df.to_records(index=False) and a list of dicts: df.to_dict('records') — Martin Thoma
– Martin Thoma, Commented Sep 3, 2019 at 12:36
@MartinThoma both to_records and to_dict('records') screw my data-types. Known bug but makes this solutions worthless... — Jochen
– Jochen, Commented Jul 3, 2020 at 15:41

T.C. Proctor · Accepted Answer · 2019-04-03 20:31:14Z

380

list(data_set.itertuples(index=False))

As of 17.1, the above will return a list of namedtuples.

If you want a list of ordinary tuples, pass name=None as an argument:

list(data_set.itertuples(index=False, name=None))

edited Apr 3, 2019 at 20:31

T.C. Proctor

6,5966 gold badges31 silver badges40 bronze badges

answered Dec 31, 2015 at 21:57

Kamil Sindi

23k19 gold badges101 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Axel Over a year ago

This should be the accepted answer IMHO (now that a dedicated feature exists). BTW, if you want normal tuples in your zip iterator (instead of namedtuples), then call: data_set.itertuples(index=False, name=None)

cs95 Over a year ago

Actually, it shouldn't. itertuples is slow. Avoid if possible. For loops (as shown the the accepted answer) is usually faster in these cases.

T.C. Proctor Over a year ago

@coldspeed The lesson I got from the linked question is that itertuples is slow because converting to tuples is usually slower than vectorized/cython operations. Given that the question is asking to convert to tuples, is there any reason that we'd think that the accepted answer is faster? The quick test I did indicates that the itertuples version is faster.

T.C. Proctor Over a year ago

I posted my speed test results in this answer

AMC Over a year ago

@cs95 itertuples is slow. The performance of .iterrows() is notoriously bad compared to that of .itertuples() though. In some simple benchmarks I ran the former was around 1000 times slower than the latter, they can hardly be compared.

|

cs95 · Accepted Answer · 2019-12-09 22:46:25Z

261

How about:

subset = data_set[['data_date', 'data_1', 'data_2']]
tuples = [tuple(x) for x in subset.to_numpy()]

for pandas < 0.24 use

tuples = [tuple(x) for x in subset.values]

edited Dec 9, 2019 at 22:46

cs95

406k106 gold badges744 silver badges797 bronze badges

answered Mar 18, 2012 at 20:39

Wes McKinney

106k32 gold badges146 silver badges109 bronze badges

3 Comments

vy32 Over a year ago

Please see @ksindi's answer below for using .itertuples, which will be more efficient than getting the values as an array and t urning them into a tuple.

RufusVS Over a year ago

slightly cleaner is: tuples=map(tuple,subset.values)

AMC Over a year ago

This can cast values to a different type though, right?

piRSquared · Accepted Answer · 2019-11-04 17:35:20Z

Motivation
Many data sets are large enough that we need to concern ourselves with speed/efficiency. So I offer this solution in that spirit. It happens to also be succinct.

For the sake of comparison, let's drop the index column

df = data_set.drop('index', 1)

Solution
I'll propose the use of zip and map

list(zip(*map(df.get, df)))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

It happens to also be flexible if we wanted to deal with a specific subset of columns. We'll assume the columns we've already displayed are the subset we want.

list(zip(*map(df.get, ['data_date', 'data_1', 'data_2'])))

[('2012-02-17', 24.75, 25.03),
 ('2012-02-16', 25.0, 25.07),
 ('2012-02-15', 24.99, 25.15),
 ('2012-02-14', 24.68, 25.05),
 ('2012-02-13', 24.62, 24.77),
 ('2012-02-10', 24.38, 24.61)]

What is Quicker?

Turn's out records is quickest followed by asymptotically converging zipmap and iter_tuples

I'll use a library simple_benchmarks that I got from this post

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

import pandas as pd
import numpy as np

def tuple_comp(df): return [tuple(x) for x in df.to_numpy()]
def iter_namedtuples(df): return list(df.itertuples(index=False))
def iter_tuples(df): return list(df.itertuples(index=False, name=None))
def records(df): return df.to_records(index=False).tolist()
def zipmap(df): return list(zip(*map(df.get, df)))

funcs = [tuple_comp, iter_namedtuples, iter_tuples, records, zipmap]
for func in funcs:
    b.add_function()(func)

def creator(n):
    return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})

@b.add_arguments('Rows in DataFrame')
def argument_provider():
    for n in (10 ** (np.arange(4, 11) / 2)).astype(int):
        yield n, creator(n)

r = b.run()

Check the results

r.to_pandas_dataframe().pipe(lambda d: d.div(d.min(1), 0))

        tuple_comp  iter_namedtuples  iter_tuples   records    zipmap
100       2.905662          6.626308     3.450741  1.469471  1.000000
316       4.612692          4.814433     2.375874  1.096352  1.000000
1000      6.513121          4.106426     1.958293  1.000000  1.316303
3162      8.446138          4.082161     1.808339  1.000000  1.533605
10000     8.424483          3.621461     1.651831  1.000000  1.558592
31622     7.813803          3.386592     1.586483  1.000000  1.515478
100000    7.050572          3.162426     1.499977  1.000000  1.480131

r.plot()

Ramón J Romero y Vigil · Accepted Answer · 2012-12-05 19:42:50Z

50

A generic way:

[tuple(x) for x in data_set.to_records(index=False)]

answered Dec 5, 2012 at 19:42

Ramón J Romero y Vigil

18k8 gold badges84 silver badges140 bronze badges

2 Comments

Amir Shabani Over a year ago

Isn't data_set.to_records(index=False).tolist() better?

Mahery Ranaivoson Over a year ago

@AmirShabani cast to List is preferred over this one to avoid losing the data type and format. For eg. when a column is of type pandas DateTime the date format won't be kept as it is.

Gustavo Gonçalves · Accepted Answer · 2019-04-12 03:22:01Z

17

The most efficient and easy way:

list(data_set.to_records())

You can filter the columns you need before this call.

answered Apr 12, 2019 at 3:22

Gustavo Gonçalves

4937 silver badges11 bronze badges

2 Comments

user3415167 Over a year ago

I think 'index=False' should be given as an argument to to_records(). Thus, list(data_set.to_records(index=False))

Gabriele Giuseppini Over a year ago

This is not what the OP asked: the result of to_records is not a tuple but rather a numpy record array. When you need an actual tuple - e.g. because you need to hash it - this solution won't cut it.

Nickil Maveli · Accepted Answer · 2016-12-26 13:47:19Z

Here's a vectorized approach (assuming the dataframe, data_set to be defined as df instead) that returns a list of tuples as shown:

>>> df.set_index(['data_date'])[['data_1', 'data_2']].to_records().tolist()

produces:

[(datetime.datetime(2012, 2, 17, 0, 0), 24.75, 25.03),
 (datetime.datetime(2012, 2, 16, 0, 0), 25.0, 25.07),
 (datetime.datetime(2012, 2, 15, 0, 0), 24.99, 25.15),
 (datetime.datetime(2012, 2, 14, 0, 0), 24.68, 25.05),
 (datetime.datetime(2012, 2, 13, 0, 0), 24.62, 24.77),
 (datetime.datetime(2012, 2, 10, 0, 0), 24.38, 24.61)]

The idea of setting datetime column as the index axis is to aid in the conversion of the Timestamp value to it's corresponding datetime.datetime format equivalent by making use of the convert_datetime64 argument in DF.to_records which does so for a DateTimeIndex dataframe.

This returns a recarray which could be then made to return a list using .tolist

More generalized solution depending on the use case would be:

df.to_records().tolist()                              # Supply index=False to exclude index

T.C. Proctor · Accepted Answer · 2019-03-05 18:56:41Z

This answer doesn't add any answers that aren't already discussed, but here are some speed results. I think this should resolve questions that came up in the comments. All of these look like they are O(n), based on these three values.

TL;DR: tuples = list(df.itertuples(index=False, name=None)) and tuples = list(zip(*[df[c].values.tolist() for c in df])) are tied for the fastest.

I did a quick speed test on results for three suggestions here:

The zip answer from @pirsquared: tuples = list(zip(*[df[c].values.tolist() for c in df]))
The accepted answer from @wes-mckinney: tuples = [tuple(x) for x in df.values]
The itertuples answer from @ksindi with the name=None suggestion from @Axel: tuples = list(df.itertuples(index=False, name=None))

from numpy import random
import pandas as pd


def create_random_df(n):
    return pd.DataFrame({"A": random.randint(n, size=n), "B": random.randint(n, size=n)})

Small size:

df = create_random_df(10000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

Gives:

1.66 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
15.5 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.74 ms ± 75.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Larger:

df = create_random_df(1000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

Gives:

202 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.52 s ± 98.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
209 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

As much patience as I have:

df = create_random_df(10000000)
%timeit tuples = list(zip(*[df[c].values.tolist() for c in df]))
%timeit tuples = [tuple(x) for x in df.values]
%timeit tuples = list(df.itertuples(index=False, name=None))

Gives:

1.78 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.68 s ± 96.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The zip version and the itertuples version are within the confidence intervals each other. I suspect that they are doing the same thing under the hood.

These speed tests are probably irrelevant though. Pushing the limits of my computer's memory doesn't take a huge amount of time, and you really shouldn't be doing this on a large data set. Working with those tuples after doing this will end up being really inefficient. It's unlikely to be a major bottleneck in your code, so just stick with the version you think is most readable.

I updated my stale post. I'd been using [*zip(*map(df.get, df))] for sometime now. Anyway, thought you'd find it interesting.
@piRSquared Oooh. I like the pretty plot. I guess that looks about like it's O(n).

g0x11 · Accepted Answer · 2020-05-04 11:41:37Z

13

Changing the data frames list into a list of tuples.

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
print(df)
OUTPUT
   col1  col2
0     1     4
1     2     5
2     3     6

records = df.to_records(index=False)
result = list(records)
print(result)
OUTPUT
[(1, 4), (2, 5), (3, 6)]

edited May 4, 2020 at 11:41

answered Apr 27, 2020 at 7:47

g0x11

76210 silver badges22 bronze badges

2 Comments

Mark Rotteveel Over a year ago

Please don't post only code as an answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually of higher quality, and are more likely to attract upvotes.

Bikash Gyawali Over a year ago

This gives you a list of numpy.record objects rather than a list of tuples.

Alsphere · Accepted Answer · 2017-10-09 19:11:29Z

4

#try this one:

tuples = list(zip(data_set["data_date"], data_set["data_1"],data_set["data_2"]))
print (tuples)

answered Oct 9, 2017 at 19:11

Alsphere

5431 gold badge9 silver badges24 bronze badges

Comments

Ankur Panwar · Accepted Answer · 2017-04-17 15:14:49Z

3

More pythonic way:

df = data_set[['data_date', 'data_1', 'data_2']]
map(tuple,df.values)

answered Apr 17, 2017 at 15:14

Ankur Panwar

954 bronze badges

1 Comment

AMC Over a year ago

More pythonic way: The exact opposite, actually. map() is notoriously unpythonic.

desertnaut · Accepted Answer · 2023-05-04 17:27:47Z

2

To Convert Dataframe Column into Arrays of Tuple without Index:

import pandas as pd
my_converted_array_tuples = pd.Dataframe(mydf['my_column_name']).to_records(index=False)

Note: If you need Index remove index line

edited May 4, 2023 at 17:27

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

answered May 4, 2023 at 4:01

Saiyam Jain

1918 bronze badges

Collectives™ on Stack Overflow

Pandas convert dataframe to array of tuples

11 Answers 11

8 Comments

3 Comments

What is Quicker?

Comments

2 Comments

2 Comments

Comments

2 Comments

2 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

8 Comments

3 Comments

What is Quicker?

Comments

2 Comments

2 Comments

Comments

2 Comments

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related