Capturing repeated pattern in Python

Question

I'm trying to implement some kind of markdown like behavior for a Python log formatter.

Let's take this string as example:

**This is a warning**: Virus manager __failed__

A few regexes later the string has lost the markdown like syntax and been turned into bash code:

\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m

But that should be compressed to

\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m

I tried these, beside many other non working solutions:

(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'

(\\033\[([\d]+)m)+ many results, not ok

(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok

and others..

My goal is to have as results:

Input \033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m

Output

Match 1 033[33m\033[1m

Group1: 33

Group2: 1

Match 2 033[0m\033[0m

Group1: 0

Group2: 0

In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.

there is a lot of unnecessary information, can you post a sample input and the expected output? — marcos
– marcos, Commented Mar 15, 2020 at 18:27
It's at the end... input: \033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m output read the end of the message — Psychokiller1888
– Psychokiller1888, Commented Mar 15, 2020 at 18:29
The number of match groups created by an expression will always be a set value. For instance (...)+ will generate only one match group. — Todd
– Todd, Commented Mar 15, 2020 at 20:49
This is not 100% clear: what are the rules? Can you have \033[33m\033[1m\033[22m? If yes, what is the expected output? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Mar 15, 2020 at 21:51
Try re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Mar 15, 2020 at 22:04

Wiktor Stribiżew · Accepted Answer · 2020-03-15 22:49:32Z

1

You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.

You may use

re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)

See the Python demo online

The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.

answered Mar 15, 2020 at 22:49

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Cary Swoveland · Accepted Answer · 2020-03-16 20:12:31Z

The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex

r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)

to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.

Demo

The regex performs the following operations:

^           # match beginning of line
(           # begin cap grp 1
  \\0       # match '\0'
  (\d+)     # match 1+ digits in cap grp 2
  \[        # match '['
  \2        # match contents of cap grp 2
)           # end cap grp 1
[a-z]       # match a lc letter
\\0         # match '\0'      
\2          # match contents of cap grp 2
\[          # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
            #   end of the line in cap grp 3

As you see, the portion of the string captured in group 1 is

\033[33

I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.

The next part of the string is to be replaced and therefore is not captured:

m\\033[

I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.

The remainder of the string,

1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m

is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.

One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.

Thank you for the detailed explanation! I will test and return my results asap!

oriberu · Accepted Answer · 2020-03-15 19:48:06Z

0

This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.

\\033\[(\d+)m\\033\[(\d+)m

answered Mar 15, 2020 at 19:48

oriberu

1,21611 silver badges7 bronze badges

2 Comments

Psychokiller1888 Over a year ago

Thanks, but this is only valid for 2 consecutive, not random N consecutives

oriberu Over a year ago

Sure. My point was, that you're not doing this with one expression and repeating capturing groups. You'd typically use one expression to grab the sequence as a whole, a second (or string functions) to grab a list of numbers from it and some logic to process the values -- as in Wiktor Stribiżew's answer.

Collectives™ on Stack Overflow

Capturing repeated pattern in Python

3 Answers 3

Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related