Break Pandas series into multiple DataFrame columns based on string position

Question

Given a Pandas Series with strings, I'd like to create a DataFrame with columns for each section of the Series based on position.

For example, given this input:

s = pd.Series(['abcdef', '123456'])
ind = [2, 3, 1]

Ideally I'd get this:

target_df = pd.DataFrame({
  'col1': ['ab', '12'],
  'col2': ['cde', '345'],
  'col3': ['f', '6']
})

One way is creating them one-by-one, e.g.:

df['col1'] = s.str[:3]
df['col2'] = s.str[3:5]
df['col3'] = s.str[5]

But I'm guessing this is slower than a single split.

I tried a regex, but not sure how to parse the result:

pd.DataFrame(s.str.split("(^(\w{2})(\w{3})(\w{1}))"))
#                          0
# 0 [, abcdef, ab, cde, f, ]
# 1 [, 123456, 12, 345, 6, ]

Max Ghenis · Accepted Answer · 2018-09-21 00:23:34Z

4

Your regex is almost there (note Series.str.extract(expand=True) returns a DataFrame):

df = s.str.extract("^(\w{2})(\w{3})(\w{1})", expand = True)
df.columns = ['col1', 'col2', 'col3']
#   col1    col2    col3
# 0 ab      cde     f
# 1 12      345     6

Here's a function to generalize this:

def split_series_by_position(s, ind, cols):
  # Construct regex.
  regex = "^(\w{" + "})(\w{".join(map(str, ind)) + "})"
  df = s.str.extract(regex, expand=True)
  df.columns = cols
  return df

# Example which will produce the result above.
split_series_by_position(s, ind, ['col1', 'col2', 'col3'])

edited Sep 21, 2018 at 0:23

Max Ghenis

16k17 gold badges94 silver badges142 bronze badges

answered Sep 20, 2018 at 19:40

Vaishali

38.5k5 gold badges62 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Break Pandas series into multiple DataFrame columns based on string position

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related