0

I have a string of account information with multiple accounts in the string (the example shows one line, where I actually have a text file with multiple lines of account data, so there is another loop going through each line in the text file in my code). I need to pull out each account into its own line. The code below works, but I assume there is a more efficient or better way to do it. I am just starting to learn Regex.

import re
import pandas as pd

allAccounts = []
example = '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'  

rex = '[0-9]{1,2}-[0-9]{1,7}-[0-9]{1,2}'
accounts = re.findall(rex, example)
for account in accounts:       
    example= example.replace(account, f'||{account}')
       
example = [account.replace('  ', '|').split('|') for account in example.split('||')][1:]
allAccounts += example
df = pd.DataFrame(allAccounts)
df

from the regex portion of the code, I want it to return:

['02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42', ' 02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'] 
# or 
'||02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  ||02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'

The code returns the following df which is what ultimately I want:

                0       1             2     3          4         5      6      7       8
0   02-0015800-00   NAME1   100 SOME ST Active  3/8/2021    139.23  139.81  0.42    
1   02-0023901-01   NAME2   101 SOME ST Active  3/8/2021    512.33  482.96  -5.73   None

But I feel like there is a better way to use the Regex than what I am doing. Reading the docs it seems like re.sub should do it, but it is only replacing the first account number it comes upon, and it only want to replace the account number not add the '||' separator to the beginning.


update:

Using the following it gets close to what I want but not sure why the first item in the list is ''.

example = '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'  
rex = re.compile('(?=[0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9])')
re.split(rex, example)

outputs:

['',
 '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  ',
 '02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73']
7
  • Do you actually need a regular expression that matches the entire account info, or do you just need to be able to find the boundaries? If the latter, what is the actual rule that tells you where one account's info ends and the next begins? While looking through the re module documentation, did you see re.split? 90% of programming is precisely understanding the problem that needs to be solved. Commented May 10, 2022 at 14:58
  • I cannot get re.split to work. what I want the regex to return is ['02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42', ' 02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73'] or '||02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42 ||02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73'. Where in the latter I can use the '||' to split each account into a list. Commented May 10, 2022 at 15:52
  • Another approach: there is a set number of fields in each line (seems to be eight of them), right? What if you just split the data into fields (separated by the double space), and then reshaped that data into 8-by-N? Commented May 10, 2022 at 15:54
  • @Karl Knechtel I wish it were that easy, but they are not all 8, some have 7 and some 9 Commented May 10, 2022 at 15:57
  • Where did this data come from? In the cases where there are 9 values, what do they all represent? In the cases with fewer than 9, is it always the same ones that are missing? Because ragged data like that is a much bigger problem that should have been mentioned up front - the point of a DataFrame is that it represents a rectangular, consistent table of data. Commented May 10, 2022 at 16:03

1 Answer 1

1

Instead of using split, you can match the values:

\b\d\d-\d{7}-\d\d\b.*?(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)

Explanation

  • \b\d\d-\d{7}-\d\d\b Match the pattern with 2 digits - 7 digits - 2 digits using a quantifier
  • .*? Match any character as least as possible
  • (?=\s*\b\d\d-\d{7}-\d\d\b.*?|$) Positive lookahead to assert either the digits pattern to the right or the end of the string to also match the last occurrence

Regex demo

Example

import re

pattern = r"\b\d\d-\d{7}-\d\d\b.*?(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)"

s = "02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73"

print(re.findall(pattern, s))

Output

['02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42', '02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73']

If you must use split:

import re

pattern = r"\b(?=\d\d-\d{7}-\d\d\b)"

s = "02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73"

result = [m.strip() for m in re.split(pattern, s) if m]
print(result)

See a Python demo

Sign up to request clarification or add additional context in comments.

1 Comment

forth bird I like the split version of your answer, at least for me, it is easier to see what the code is doing. Thanks for explanation, it helps a lot. What is purpose of the \b at the beginning and end of the pattern? Is one version faster then the other?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.