Extract multiple values from a string via regex

Question

I have a number of strings from a 3rd party data source that are in various lengths that contain both underscore and spaces. Each portion of the string is important and I am trying to break it apart into various fields via python. The string does not have special characters (\n, \t, etc.) - should just be spaces, underscores, and parentheses are used to break the data parts.

String	Year	State	ID	Sub ID	Extra1	Extra2
2022_UT_T1000_100 (Classification1 Classification2)	2022	UT	T1000	100	Classification1	Classification2
2021_TX_V999_005 (Classification1)	2021	TX	V999	005	Classification1
1999_GA_123456_7890	1999	GA	123456	7890

I could split the string by the underscore, then split the last field by a space but that seems error-prone. Regex is likely the best approach.

I can match the year using this: ^[1-9]\d{3,}$. However, when trying to add an OR operator, it will only find the underscore.

Is there a way to extract this data when I know a pattern exists?

@TimBiegeleisen python and I updated the question. Actually didn't know language mattered with regex. — mikebmassey
– mikebmassey, Commented Feb 7, 2022 at 2:34
@TimBiegeleisen It is a dataframe - trying to blow the string out to other columns. — mikebmassey
– mikebmassey, Commented Feb 7, 2022 at 2:36
Is the first string in your example '2022_UT_T1000_100 (Classification1 Classification2)' or 2022_UT_T1000_100\n(Classification1\n Classification2)' or something else? — Cary Swoveland
– Cary Swoveland, Commented Feb 7, 2022 at 2:37

Tim Biegeleisen · Accepted Answer · 2022-02-07 02:52:32Z

2

You could try using str.extract with the regex pattern:

^(\d{4})_([^_]+)_([^_]+)_([^_ ]+)(?: \((\S+)(?: (\S+))?\))?$

Note that this pattern assumes that there would only be three variants in the string column, namely no extras, one extra, or at most two extras. For arbitrary number of words in parentheses, we would need a different approach.

Python script:

df[["Year", "State", "ID", "Sub ID", "Extra1", "Extra2"]] = df["String"].str.extract(r'^(\d{4})_([^_]+)_([^_]+)_([^_ ]+)(?: \((\S+)(?: (\S+))?\))?$')

Here is a regex demo showing that the pattern is working for all variants of your string column.

edited Feb 7, 2022 at 2:52

answered Feb 7, 2022 at 2:46

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extract multiple values from a string via regex

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related