I have a number of strings from a 3rd party data source that are in various lengths that contain both underscore and spaces. Each portion of the string is important and I am trying to break it apart into various fields via python. The string does not have special characters (\n, \t, etc.) - should just be spaces, underscores, and parentheses are used to break the data parts.
| String | Year | State | ID | Sub ID | Extra1 | Extra2 |
|---|---|---|---|---|---|---|
| 2022_UT_T1000_100 (Classification1 Classification2) | 2022 | UT | T1000 | 100 | Classification1 | Classification2 |
| 2021_TX_V999_005 (Classification1) | 2021 | TX | V999 | 005 | Classification1 | |
| 1999_GA_123456_7890 | 1999 | GA | 123456 | 7890 |
I could split the string by the underscore, then split the last field by a space but that seems error-prone. Regex is likely the best approach.
I can match the year using this: ^[1-9]\d{3,}$. However, when trying to add an OR operator, it will only find the underscore.
Is there a way to extract this data when I know a pattern exists?
'2022_UT_T1000_100 (Classification1 Classification2)'or2022_UT_T1000_100\n(Classification1\n Classification2)' or something else?