2

I'm trying to implement some kind of markdown like behavior for a Python log formatter.

Let's take this string as example:

**This is a warning**: Virus manager __failed__

A few regexes later the string has lost the markdown like syntax and been turned into bash code:

\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m

But that should be compressed to

\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m

I tried these, beside many other non working solutions:

(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'

(\\033\[([\d]+)m)+ many results, not ok

(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok

and others..

My goal is to have as results:

Input \033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m

Output

Match 1 033[33m\033[1m

Group1: 33

Group2: 1

Match 2 033[0m\033[0m

Group1: 0

Group2: 0

In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.

5
  • 1
    there is a lot of unnecessary information, can you post a sample input and the expected output? Commented Mar 15, 2020 at 18:27
  • It's at the end... input: \033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m output read the end of the message Commented Mar 15, 2020 at 18:29
  • The number of match groups created by an expression will always be a set value. For instance (...)+ will generate only one match group. Commented Mar 15, 2020 at 20:49
  • This is not 100% clear: what are the rules? Can you have \033[33m\033[1m\033[22m? If yes, what is the expected output? Commented Mar 15, 2020 at 21:51
  • Try re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text) Commented Mar 15, 2020 at 22:04

3 Answers 3

1

You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.

You may use

re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)

See the Python demo online

The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.

Sign up to request clarification or add additional context in comments.

Comments

1

The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex

r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)

to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.

Demo

The regex performs the following operations:

^           # match beginning of line
(           # begin cap grp 1
  \\0       # match '\0'
  (\d+)     # match 1+ digits in cap grp 2
  \[        # match '['
  \2        # match contents of cap grp 2
)           # end cap grp 1
[a-z]       # match a lc letter
\\0         # match '\0'      
\2          # match contents of cap grp 2
\[          # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
            #   end of the line in cap grp 3

As you see, the portion of the string captured in group 1 is

\033[33

I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.

The next part of the string is to be replaced and therefore is not captured:

m\\033[

I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.

The remainder of the string,

1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m

is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.

One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.

1 Comment

Thank you for the detailed explanation! I will test and return my results asap!
0

This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.

\\033\[(\d+)m\\033\[(\d+)m

2 Comments

Thanks, but this is only valid for 2 consecutive, not random N consecutives
Sure. My point was, that you're not doing this with one expression and repeating capturing groups. You'd typically use one expression to grab the sequence as a whole, a second (or string functions) to grab a list of numbers from it and some logic to process the values -- as in Wiktor Stribiżew's answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.