1

I am trying to parse the sample input test_string1 as below:

import re
TEST_STRING1 = """Using definitions from (yyyy/mm/dd): 2016/6/8
The following files are collected:
  File: Test.exe
    Source: Google
    avping blob: 123123

Downloaded 3 Files
"""
def fun():

    regex_exp = re.compile(r"File:\s(?P<File>[^\n\r\t]+?)[\n\r\t\s]*?"
                           r"Source:\s(?P<Source>.*)[^\w\d]*?"
                           r"avping\sblob:\s(?P<Avping_blob>([A-F]|[a-f]|[0-9]){6})")
    result = {}
    result['Files'] = []
    for m in re.finditer(regex_exp, TEST_STRING1):
        result['Files'].append(m.groupdict())
    print result
if __name__ == "__main__":
    fun()

Output of the Above code is :

{'Files': [{'Source': 'Google', 'File': 'Test.exe', 'Avping_blob': '123123'}]}

I want to make some fields in Input optional such as avping blob: Like

TET_STRING1 = """Using definitions from (yyyy/mm/dd): 2016/6/8
The following files are collected:
  File: Test.exe
    Source: Google

Downloaded 3 Files
"""

In that casa above regex return no match.

I have updated the regex as

regex_exp = re.compile(r"(File:\s(?P<File>[^\n\r\t]+?)[\n\r\t\s]*?"
                           r"Source:\s(?P<Source>.*)[^\w\d]*?"
                           r"|avping\sblob:\s(?P<Avping_blob>([A-F]|[a-f]|[0-9]){6}))")

by adding | before last line. But then It gives 2 matches with OR as

{'Files': [{'Source': 'Google', 'File': 'Test.exe', 'Avping_blob': None}, {'Source': None, 'File': None, 'Avping_blob': '123123'}]}

How should I write regex that will match the pattern for both input types (with and without optional fields)? Thanks

1 Answer 1

1

You may use an optional non-capturing group and use [^\w\d]* greedy version:

(File:\s(?P<File>[^\n\r\t]+?)[\n\r\t\s]*?Source:\s(?P<Source>.*)[^\w\d]*(?:avping\sblob:\s(?P<Avping_blob>[A-Fa-f0-9]{6}))?)

See the regex demo

In your code:

regex_exp = re.compile(r"(File:\s(?P<File>[^\n\r\t]+?)[\n\r\t\s]*?"
                       r"Source:\s(?P<Source>.*)[^\w\d]*"    # <- Here ? is removed
                       r"(?:avping\sblob:\s(?P<Avping_blob>[A-Fa-f0-9]{6}))?)")
                         ^^^                                               ^

Also, ([A-F]|[a-f]|[0-9]){6}) = (?P<Avping_blob>[A-Fa-f0-9]{6}).

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer: I am getting result as '{'Files': [{'Source': 'Google', 'File': 'Test.exe', 'Avping_blob': None}]}' in both the cases in python code. avping_blob is NONE in both the cases
Try the updated version. So, the second one can be NONE, right?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.