No luck finding regex pattern python

Question

I am having no luck getting anything from this regex search.
I have a text file that looks like this:

REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~

I want to extract the lines that begin with "REF*23*" and ending with the "~"

txtfile = open(i + fileName, "r")
for line in txtfile:
    line = line.rstrip()
    p = re.findall(r'^REF*23*.+~', line)
    print(p)

But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks

@Pete's solution will work, but is less elegant because the OP wants to form a group of the digits NOT including the star after 23. — charmoniumQ
– charmoniumQ, Commented Nov 5, 2019 at 19:54
Ben, please check my answer, I think I covered all your possible requirements. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 6, 2019 at 8:35

Wiktor Stribiżew · Accepted Answer · 2019-11-08 18:09:08Z

4

You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":

results = []
with open(i + fileName, "r") as txtfile:
    for line in txtfile:
        line = line.rstrip()
        if line.startswith('REF*23*') and line.endswith('~'):
            results.append(line)

print(results)

If you need to get the digit chunks:

results = []
with open(i + fileName, "r") as txtfile:
    for line in txtfile:
        line = line.rstrip()
        if line.startswith('REF*23*') and line.endswith('~'):
            results.append(line[7:-1]) # Just grab the slice

See non-regex approach demo.

NOTES

In a regex, * must be escaped to match a literal asterisk
You read line by line, re.findall(r'^REF*23*.+~', line) makes little sense as the re.findall method is used to get multiple matches while you expect one
Your regex is not anchored on the right, you need $ or \Z to match ~ at the end of the line. So, if you want to use a regex, it would look like

m = re.search(r'^REF\*23\*(.*)~$', line): if m: results.append(m.group(1)) # To grab just the contents between delimiters # or results.append(line) # To get the whole line

See this Python demo
In your case, you search for lines that start and end with fixed text, thus, no need using a regex.

Edit as an answer to the comment

Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with ~, with the number I want in between.

You may read the file line by line and grab all occurrences of 1+ digits between REF*0F* and ~:

results = []
with open(fileName, "r") as txtfile:
    for line in txtfile:
        res = re.findall(r'REF\*0F\*(\d+)~', line)
        if len(res):
            results.extend(res)

print(results)

edited Nov 8, 2019 at 18:09

answered Nov 5, 2019 at 19:46

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

charmoniumQ Over a year ago

Nothing you said is objectively wrong, but allow me to present the case FOR using a regex. Changing the findall to a match eliminates your second point. The OP wants the digits between the star and the tilde, so end-of-line anchoring is not really necessary, and appending the line is capturing too much. Group-matching in regex is more elegant than hardcoding offsets within the line or getting them from .find.

Ben Smith Over a year ago

Thanks Wiktor. Which method is faster? I'm guessing not regex

Wiktor Stribiżew Over a year ago

@BenSmith I'd suggest using string methods.

Wiktor Stribiżew Over a year ago

@charmoniumQ As for the re.match, that is true, but then I'd even go for re.fullmatch and throw away ^ and $. As for the second point, you are wrong because OP clearly stated I want to extract the lines that begin with REF*23* and ending with the ~. Anyway, I covered the regex approach, too.

Ben Smith Over a year ago

@Wiktor thank you sir! Saved me so much anguish! Lol

|

Jan · Accepted Answer · 2019-11-05 20:09:01Z

1

You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):

raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""

result = [digits[:-1]
          for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
          for splitted in [line.split("*")]
          for digits in [splitted[-1]]]
print(result)

This yields

['526344060']

answered Nov 5, 2019 at 20:09

Jan

43.3k11 gold badges57 silver badges87 bronze badges

1 Comment

Ben Smith Over a year ago

Yeah that Is much easier. RE is a pain

charmoniumQ · Accepted Answer · 2019-11-05 21:48:38Z

1

* is a special character in regex, so you have to escape it as @The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.

r'^REF\*23\*.+~'

or

'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine

will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.

Additional changes

You might also want to throw parens around .+ to match the group, if you want to match it. Also change the findall to match, unless you expect multiple matches per line.

results = []
with open(i + fileName, "r") as txtfile:
    line = line.rstrip()
    p = re.match(r'^REF\*23\*(.+)~', line)
    if p:
        results.append(int(p.group(1)))

Consider using a regex tester such as this one.

edited Nov 5, 2019 at 21:48

answered Nov 5, 2019 at 19:49

charmoniumQ

5,5836 gold badges37 silver badges54 bronze badges

3 Comments

Ben Smith Over a year ago

Thanks. How can I get these outputs to within a list. When I run this they are all separate string objects being scalar

Ben Smith Over a year ago

perfect. Thanks

Ben Smith Over a year ago

One more issue before I close this case I was hoping you could help with @charmoniumQ Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with '~' , with the number I want in between. The code that worked for REF*23* doesn't work here. How do I solve this?

Collectives™ on Stack Overflow

No luck finding regex pattern python

3 Answers 3

7 Comments

1 Comment

Additional changes

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

1 Comment

Additional changes

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related