Converting a set of dataframes into a multidimensional array in Julia

Question

[A Julia noob question.]

Let's say I have a vector of dataframes, as below:

using DataFrames
A = DataFrame([8.1 9.2 7.5 6.6; 6.9 8.1 6.8 5.8])
B = DataFrame([9.0 2.1 5.2 5.3; 1.2 4.9 9.8 7.7])
dfs = [A, B]

Of course, I have actually a much larger number of dataframes in dfs than in this MWE, but all of them have the same dimensions, and all of them have only numeric columns.

I would like to transform dfs into a multidimensional (here, 2x4x2) array arr, with arr[:, :, 1] equal to A, and arr[:, :, 2] equal to B. How can I perform this transformation? (Of course, a for loop might do the trick, but I guess that there is a more elegant way to proceed.)

Thanks!

Colin T Bowers · Accepted Answer · 2021-06-17 12:45:47Z

3

I suppose that

f1(dfs) = cat(Matrix.(dfs)..., dims=3)

is a reasonably elegant one-liner, but it allocates temporaries.

From a speed perspective you can probably beat it easily with the following one-liner

f2(dfs) = [ dfs[k][n,m] for n = 1:size(dfs[1],1), m = 1:size(dfs[1],2), k = 1:length(dfs) ]

Having said that, if you're willing to be a little more verbose, you can probably do better again using the iteration protocols specifically designed for use with DataFrame.

function f3(dfs)
    y = Array{Float64,3}(undef, size(dfs[1],1), size(dfs[1],2), length(dfs))
    for k = 1:length(dfs) ; for (n,col) in enumerate(eachcol(dfs[k]))
        y[:,n,k] = col
    end ; end
    return y
end

As a general rule, if you want speed in Julia, loops are often the best approach. Let's do a quick comparison of the three approaches:

julia> using BenchmarkTools

julia> @btime f1($dfs);
  182.454 μs (132 allocations: 7.89 KiB)

julia> @btime f2($dfs);
  935.217 ns (21 allocations: 672 bytes)

julia> @btime f3($dfs);
  338.664 ns (11 allocations: 368 bytes)

So f3 is pretty much 6x faster than f1. You could throw an @inbounds in f2 and f3 for further optimization although I suspect it won't gain you that much...

Now, to be fair, I just assumed everything was Float64 here. However, with a quick type check up front, you can generalise this to any type (as long as it is all one type - which presumably it is given that you're wanting to convert to a single array).

edited Jun 17, 2021 at 12:45

answered Jun 17, 2021 at 12:21

Colin T Bowers

18.6k10 gold badges67 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Philopolis Over a year ago

Many thanks, it's perfect! (Btw: "As a general rule, if you want speed in Julia, loops are often the best approach", thanks for this insight. As an R user, this is a huge shift in coding pratices. ;-))

Colin T Bowers Over a year ago

@Philopolis I was just messing around and managed to get another significant boost in performance by using iteration protocols specific to dataframes. See updated answer...

Collectives™ on Stack Overflow

Converting a set of dataframes into a multidimensional array in Julia

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related