1

i am opening this question because it seems my original question requires a new direction: my original question

i would like to create a regular expression that can extract STATIC MESSAGE and DYNAMIC MESSAGE from the following types of log-entries:

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION Static Message;Dynamic Message

one log entry type has a simple structure:

file:date TYPE STATIC;DYNAMIC

the other is not so simple when trying to be parsed with regex:

file:date MODULE.NAME TYPE CONNECTION.OR.THREAD STATIC;DYNAMIC

where the MODULE.NAME and CONNECTION.OR.THREAD are either both present or not present.

my regular expression so far which works on the first type of log entry is:

(?:.*?):(?:\w{3} \d{1,2} \d{1,2}:\d{1,2}:\d{1,2})(?:\s+?)(?:[\S|\.]*?(?:\s*?))?(?:(?:TYPE1)|(?:TYPE2)|(?:TYPE3))(?:\s+?)(?:\S+?(?:\s+?))?(.+){1}(?:;(.+)){1}

but whenever i get to the second type of entry, i am also getting the CONNECTION.OR.THREAD as part of my first capturing group.

i am hoping for a way to use the lookahead or lookbehind feature so that i can capture STATIC and DYNAMIC and ignore the CONNECTION.OR.THREAD part if there is a MODULE.NAME ?

i hope this question is clear, please refer to my original if it seems a bit bleak. thank you.

EDIT: for clarification. every line of the log is different then the others, each line starts with a filepath, then a : then the date, in the following format: MMM DD HH:MM:SS and then it gets tricky, either a MODULE.NAME which varies, followed by the TYPE which also varies, followed by CONNECTION.OR.THREAD which varies, or with just the TYPE. after which there is the STATIC MESSAGE then a ; then a DYNAMIC MESSAGE both the static and dynamic message vary, the usage of the term STATIC is simply because an error can be for instance "unable to connect to server; server1.com" so the static part of the error is "unable to connect to server" and the dynamic part is "server1.com"

10
  • 1
    You need to be explicit what parts of your log lines are placeholders for content that varies from one line to the next, and what parts are truly static. For the parts that can vary, stating that there is a limited list of possibilities will also help. Commented Sep 3, 2012 at 13:23
  • @MartijnPieters i have made an edit, hopefuly it clears things up. Commented Sep 3, 2012 at 13:49
  • what does "static" mean? Is it a fixed string/a list of fixed strings? Commented Sep 3, 2012 at 13:50
  • @J.F.Sebastian i explained it in my edit: the "static" message is a bit confusing, just pretend it says "part1;part2" Commented Sep 3, 2012 at 13:51
  • 1
    @InbarRose: Being able to use a tool without shooting myself in the foot every time does not equal being great with them... Commented Sep 3, 2012 at 14:35

1 Answer 1

0

at the moment i have made this massive regex:

(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)(?:(?:(?:(?:TYPE1)|(?:(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))|(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1})))

i will split it into parts:

FILE/DATE + SPACE:

(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)

and then EITHER:

SIMPLE: (TYPE STATIC;DYNAMIC)

(?:(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))

OR COMPLEX: (MODULE.NAME TYPE CONNECTION.OR.THREAD STATIC;DYNAMIC)

(:?(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1}))

it does the trick. but its huge and i think it can be improved. so please if anyone can improve it, please do.

EDIT:

there is a problem though. because now there are 4 capturing groups. so i can not know ahead of time if i must look in captured[0:1] or captured[2:3] for my results. anyone have a way to do this that i will not have to check each time if i have something there? or perhaps a way to eliminate empty capturing groups from results, or maybe to only get non-empty results from the list of results? something? my brain is fried.

EDIT2:

as @martijn pieters suggested i removed the extraneous grouping this is my current regex:

.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3)\s+?(.+){1};(.+){1})|(?:\S+(‌​?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+\s+?(.+){1};(.+){1}))

which works fine. i am concerned about (?:TYPE1|TYPE2|TYPE3) being miss-interpreted as TYPE(1|T)YPE(2|T)YPE3 any insight would be appreciated.

also, how best to go about parsing my results - seeing as i will get a list of 4 items with either the first 2 or the second 2 being empty and the other having my static/dynamic results.

EDIT3:

okay, i have done a hybrid solution. i have remade my regular expression:

.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3))|(?:\S+(?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+))\s+(.*)

i now only have 1 capture group, which is the STATIC;DYNAMIC part. once i get this i do what i was doing before (see my previous question)

for item in captured:
    parts = item.split(";")
    static = parts[0]
    dynamic = ";".join(parts[1:])

that is my solution. thank you @Martijn Pieters especially for your help. i hope this can help someone in the future.

Sign up to request clarification or add additional context in comments.

9 Comments

First; try and reduce the groups a little, you seem to have redundant groups where you don't need them ((?:.*?) doesn't need a group for example). Second, use (?P<groupname>..) namings so you get a dictionary result, much easier to work with. See pastie.org/private/iyfwswt3stmvur25v9vu5w for an example we use to parse log lines from a web server.
@MartijnPieters if i were to give 2 capture groups the same name, what would happen? for instance, if i give the <static> name to both groups capturing static, knowing they are mutually exclusive since they are on either sides of a | what will happen?
That would fail, you cannot give the same name to two groups.
@MartijnPieters when things are in groups its easier for me to look at the regex, does it improve the efficiency if there are less groups? even if they are non-capture groups? is there a better way to remove empty results from my captured list other than simply iterating over them and checking their length?
The example I linked to shows you a different way to improve readability; I find your patterns to be extremely unreadable at the moment. (?:..) groups do carry meaning for the regular expression, I don't know if the engine recognizes these groups as otherwise meaningless for the match and leave them out. You should only use them when you do need grouping (such as for specifying repeating compound groups or | selection groupings) to be on the safe side.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.