3

[A Julia noob question.]

Let's say I have a vector of dataframes, as below:

using DataFrames
A = DataFrame([8.1 9.2 7.5 6.6; 6.9 8.1 6.8 5.8])
B = DataFrame([9.0 2.1 5.2 5.3; 1.2 4.9 9.8 7.7])
dfs = [A, B]

Of course, I have actually a much larger number of dataframes in dfs than in this MWE, but all of them have the same dimensions, and all of them have only numeric columns.

I would like to transform dfs into a multidimensional (here, 2x4x2) array arr, with arr[:, :, 1] equal to A, and arr[:, :, 2] equal to B. How can I perform this transformation? (Of course, a for loop might do the trick, but I guess that there is a more elegant way to proceed.)

Thanks!

1 Answer 1

3

I suppose that

f1(dfs) = cat(Matrix.(dfs)..., dims=3)

is a reasonably elegant one-liner, but it allocates temporaries.

From a speed perspective you can probably beat it easily with the following one-liner

f2(dfs) = [ dfs[k][n,m] for n = 1:size(dfs[1],1), m = 1:size(dfs[1],2), k = 1:length(dfs) ]

Having said that, if you're willing to be a little more verbose, you can probably do better again using the iteration protocols specifically designed for use with DataFrame.

function f3(dfs)
    y = Array{Float64,3}(undef, size(dfs[1],1), size(dfs[1],2), length(dfs))
    for k = 1:length(dfs) ; for (n,col) in enumerate(eachcol(dfs[k]))
        y[:,n,k] = col
    end ; end
    return y
end

As a general rule, if you want speed in Julia, loops are often the best approach. Let's do a quick comparison of the three approaches:

julia> using BenchmarkTools

julia> @btime f1($dfs);
  182.454 μs (132 allocations: 7.89 KiB)

julia> @btime f2($dfs);
  935.217 ns (21 allocations: 672 bytes)

julia> @btime f3($dfs);
  338.664 ns (11 allocations: 368 bytes)

So f3 is pretty much 6x faster than f1. You could throw an @inbounds in f2 and f3 for further optimization although I suspect it won't gain you that much...

Now, to be fair, I just assumed everything was Float64 here. However, with a quick type check up front, you can generalise this to any type (as long as it is all one type - which presumably it is given that you're wanting to convert to a single array).

Sign up to request clarification or add additional context in comments.

2 Comments

Many thanks, it's perfect! (Btw: "As a general rule, if you want speed in Julia, loops are often the best approach", thanks for this insight. As an R user, this is a huge shift in coding pratices. ;-))
@Philopolis I was just messing around and managed to get another significant boost in performance by using iteration protocols specific to dataframes. See updated answer...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.