2

Given a Pandas Series with strings, I'd like to create a DataFrame with columns for each section of the Series based on position.

For example, given this input:

s = pd.Series(['abcdef', '123456'])
ind = [2, 3, 1]

Ideally I'd get this:

target_df = pd.DataFrame({
  'col1': ['ab', '12'],
  'col2': ['cde', '345'],
  'col3': ['f', '6']
})

One way is creating them one-by-one, e.g.:

df['col1'] = s.str[:3]
df['col2'] = s.str[3:5]
df['col3'] = s.str[5]

But I'm guessing this is slower than a single split.

I tried a regex, but not sure how to parse the result:

pd.DataFrame(s.str.split("(^(\w{2})(\w{3})(\w{1}))"))
#                          0
# 0 [, abcdef, ab, cde, f, ]
# 1 [, 123456, 12, 345, 6, ]

1 Answer 1

4

Your regex is almost there (note Series.str.extract(expand=True) returns a DataFrame):

df = s.str.extract("^(\w{2})(\w{3})(\w{1})", expand = True)
df.columns = ['col1', 'col2', 'col3']
#   col1    col2    col3
# 0 ab      cde     f
# 1 12      345     6

Here's a function to generalize this:

def split_series_by_position(s, ind, cols):
  # Construct regex.
  regex = "^(\w{" + "})(\w{".join(map(str, ind)) + "})"
  df = s.str.extract(regex, expand=True)
  df.columns = cols
  return df

# Example which will produce the result above.
split_series_by_position(s, ind, ['col1', 'col2', 'col3'])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.