3

I am having no luck getting anything from this regex search.
I have a text file that looks like this:

REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~

I want to extract the lines that begin with "REF*23*" and ending with the "~"

txtfile = open(i + fileName, "r")
for line in txtfile:
    line = line.rstrip()
    p = re.findall(r'^REF*23*.+~', line)
    print(p)

But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks

4
  • 2
    Try escaping \* like ^REF\*23\*.+~ Commented Nov 5, 2019 at 19:45
  • 1
    I was successful with ^REF\*23.*~$ Commented Nov 5, 2019 at 19:51
  • 1
    @Pete's solution will work, but is less elegant because the OP wants to form a group of the digits NOT including the star after 23. Commented Nov 5, 2019 at 19:54
  • 1
    Ben, please check my answer, I think I covered all your possible requirements. Commented Nov 6, 2019 at 8:35

3 Answers 3

4

You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":

results = []
with open(i + fileName, "r") as txtfile:
    for line in txtfile:
        line = line.rstrip()
        if line.startswith('REF*23*') and line.endswith('~'):
            results.append(line)

print(results)

If you need to get the digit chunks:

results = []
with open(i + fileName, "r") as txtfile:
    for line in txtfile:
        line = line.rstrip()
        if line.startswith('REF*23*') and line.endswith('~'):
            results.append(line[7:-1]) # Just grab the slice

See non-regex approach demo.

NOTES

  • In a regex, * must be escaped to match a literal asterisk
  • You read line by line, re.findall(r'^REF*23*.+~', line) makes little sense as the re.findall method is used to get multiple matches while you expect one
  • Your regex is not anchored on the right, you need $ or \Z to match ~ at the end of the line. So, if you want to use a regex, it would look like

    m = re.search(r'^REF\*23\*(.*)~$', line): if m: results.append(m.group(1)) # To grab just the contents between delimiters # or results.append(line) # To get the whole line

    See this Python demo

  • In your case, you search for lines that start and end with fixed text, thus, no need using a regex.

Edit as an answer to the comment

Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with ~, with the number I want in between.

You may read the file line by line and grab all occurrences of 1+ digits between REF*0F* and ~:

results = []
with open(fileName, "r") as txtfile:
    for line in txtfile:
        res = re.findall(r'REF\*0F\*(\d+)~', line)
        if len(res):
            results.extend(res)

print(results)
Sign up to request clarification or add additional context in comments.

7 Comments

Nothing you said is objectively wrong, but allow me to present the case FOR using a regex. Changing the findall to a match eliminates your second point. The OP wants the digits between the star and the tilde, so end-of-line anchoring is not really necessary, and appending the line is capturing too much. Group-matching in regex is more elegant than hardcoding offsets within the line or getting them from .find.
Thanks Wiktor. Which method is faster? I'm guessing not regex
@BenSmith I'd suggest using string methods.
@charmoniumQ As for the re.match, that is true, but then I'd even go for re.fullmatch and throw away ^ and $. As for the second point, you are wrong because OP clearly stated I want to extract the lines that begin with REF*23* and ending with the ~. Anyway, I covered the regex approach, too.
@Wiktor thank you sir! Saved me so much anguish! Lol
|
1

You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):

raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""

result = [digits[:-1]
          for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
          for splitted in [line.split("*")]
          for digits in [splitted[-1]]]
print(result)

This yields

['526344060']

1 Comment

Yeah that Is much easier. RE is a pain
1

* is a special character in regex, so you have to escape it as @The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.

r'^REF\*23\*.+~'

or

'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine

will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.

Additional changes

You might also want to throw parens around .+ to match the group, if you want to match it. Also change the findall to match, unless you expect multiple matches per line.

results = []
with open(i + fileName, "r") as txtfile:
    line = line.rstrip()
    p = re.match(r'^REF\*23\*(.+)~', line)
    if p:
        results.append(int(p.group(1)))

Consider using a regex tester such as this one.

3 Comments

Thanks. How can I get these outputs to within a list. When I run this they are all separate string objects being scalar
perfect. Thanks
One more issue before I close this case I was hoping you could help with @charmoniumQ Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with '~' , with the number I want in between. The code that worked for REF*23* doesn't work here. How do I solve this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.