2

I need to populate a pandas dataframe in the following format

ID.  vector1              vector2        scalar
11  [0.34, 0.56, 0.99]   [0.24, 0.44]   9
23  [0.12, 0.66, 0.21]   [0.21, 0.24]   4
...

My input data is a numpy array of the following format

[
[11 0.34 0.56 0.99 0.24 0.44 9]
[23 0.12 0.66 0.21 0.21 0.24 4]
]

I tried playing with hsplit but couldn't get this to work.

2 Answers 2

2
import numpy as np
import pandas as pd

input = np.array([
    [11, 0.34, 0.56, 0.99, 0.24, 0.44, 9],
    [23, 0.12, 0.66, 0.21, 0.21, 0.24, 4]
])

df = pd.DataFrame(input)

If df is your starting data frame...

       0       1       2       3       4       5      6
    11.0    0.34    0.56    0.99    0.24    0.44    9.0
    23.0    0.12    0.66    0.21    0.21    0.24    4.0

You can combine columns as follows to get the output you're looking for.

The main thing that will help you is the lambda function - just put the columns you want in square brackets and make sure you set axis=1 to pass the function to the dataframe's columns.

df2 = df[0].to_frame()
df2.columns = ['ID.']
df2['vector1'] = df.apply(lambda x: [x[1], x[2], x[3]], axis=1)
df2['vector2'] = df.apply(lambda x: [x[4], x[5]], axis=1)
df2['scalar'] = df[6]

Output

    ID.     vector1             vector2         scalar
    11.0    [0.34, 0.56, 0.99]  [0.24, 0.44]    9.0
    23.0    [0.12, 0.66, 0.21]  [0.21, 0.24]    4.0
Sign up to request clarification or add additional context in comments.

Comments

1

You can convert numpy slices to lists to add them to the dataframe.
Set up the dataframe

import numpy as np
import pandas as pd

input_data = np.array([
    [11, 0.34, 0.56, 0.99, 0.24, 0.44, 9],
    [23, 0.12, 0.66, 0.21, 0.21, 0.24, 4]
])
df = pd.DataFrame({
    'ID.': input_data[:,0].astype(np.int),
    'scalar':input_data[:,-1].astype(np.int)
})

Add input_data as lists to the dataframe

df['vector1'] = input_data[:,1:4].tolist()
df['vector2'] = input_data[:,4:6].tolist()
df[['ID.', 'vector1', 'vector2', 'scalar']]

Output

   ID.             vector1       vector2  scalar
0   11  [0.34, 0.56, 0.99]  [0.24, 0.44]       9
1   23  [0.12, 0.66, 0.21]  [0.21, 0.24]       4

Micro-Benchmark

As both solutions are not vectorized, I was wondering which performance was impacted more.
Results

benchmark results

Code for the benchmark

import numpy as np
import pandas as pd

def create_data(n):
    input_data = np.array([
        [11, 0.34, 0.56, 0.99, 0.24, 0.44, 9],
        [23, 0.12, 0.66, 0.21, 0.21, 0.24, 4]
    ])
    return np.repeat(input_data, n, 0)

def add_with_list(input_data):
    df = pd.DataFrame({
        'ID.': input_data[:,0].astype(np.int),
        'scalar':input_data[:,-1].astype(np.int)
    })
    df['vector1'] = input_data[:,1:4].tolist()
    df['vector2'] = input_data[:,4:6].tolist()
    return df[['ID.', 'vector1', 'vector2', 'scalar']]

def add_with_apply(input_data):
    df = pd.DataFrame(input_data)
    df2 = df[0].to_frame()
    df2.columns = ['ID.']
    df2['vector1'] = df.apply(lambda x: [x[1], x[2], x[3]], axis=1)
    df2['vector2'] = df.apply(lambda x: [x[4], x[5]], axis=1)
    df2['scalar'] = df[6]
    return df2

import perfplot

perfplot.show(
    setup=create_data,
    kernels=[add_with_list, add_with_apply],
    n_range=[2**k for k in range(2,20)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.