0

I have this block of sample log text:

20190122 09:00,000 ###PERFORMANCE string1 string2 string3
20190122 09:10,500 number1 string1 string2 string3
20190122 09:24,670 number2 string1 string2 string3
20190122 10:05,000 number3 string1 string2 string3
20190122 10:33,960 number4 string1 string2 string3
20190122 11:00,321 number5 string1 string2 string3
20190122 11:40,256 ###PERFORMANCE string1 string2 string3
20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2
20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2
20190123 08:00,000 ###PERFORMANCE string1 string2 string3
20190123 08:10,500 number1 string1 string2 string3
20190123 08:24,670 number2 string1 string2 string3
20190123 09:05,000 number3 string1 string2 string3
20190123 10:33,960 number4 string1 string2 string3
20190123 10:00,321 number5 string1 string2 string3
20190123 13:40,256 ###PERFORMANCE string1 string2 string3
20190124 10:00,000 ###PERFORMANCE string1 string2 string3
20190124 10:10,500 number1 string1 string2 string3
20190124 10:24,670 number2 string1 string2 string3
20190124 11:05,000 number3 string1 string2 string3
20190124 12:33,960 number4 string1 string2 string3
20190124 13:00,321 number5 string1 string2 string3
20190124 13:40,256 ###PERFORMANCE string1 string2 string3

What I would like to do with Python is to detect each ###PERFORMANCE block of text like in this example:

example

As you can see, there are 3 blocks of interest, each one delimited by the text ###PERFORMANCE in the string. The first start at line 1 and ends at line 7. What is between line 7 and 10 must not be treated as a block of interest. Lines of strings for each block could also vary (so going by lines number would not be a good idea).

What I have done until now was just to read the text file line by line:

logFile = "testLog.txt"

with open(logFile) as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]

for line in content:
    print(line)

Which way I could approach to achieve this task ? Would using the NLTK be a good idea ? Would it even work for this task ? Any general suggestion ?

2
  • Is this an actual data number1 string1 string2 string3 or does that mean it will have number and three different strings? Commented Jan 24, 2019 at 9:49
  • it means that in the whole string could be different data types and not limited to one number and 3 strings. not actual data, it's just for SO example Commented Jan 24, 2019 at 9:52

2 Answers 2

1

As you are simply matching on the PERFORMANCE delimiter, using NLTK seems like overkill. A simple approach to this is to use a simple match (is the expected string on the line) and then toggle your capture-mode based on that. For instance:

in_block = False
IDENTIFIER = 'PERFORMANCE'
with open(logfile) as f:
    for line in f.readlines():
        if IDENTIFIER in line:
            # Toggle the boolean
            in_block = not in_block
        if in_block:
            print(line)
Sign up to request clarification or add additional context in comments.

1 Comment

that's amazing. it seems to work on this (simple) example posted here. will test as soon as possible on a real log to see it's beahviour. thanks for now.
1

I think what you need can be done with a simple check. Let me explain if I got it correctly. You can have a flag (True/False value) to detect if you are in the interesting block or not. Whenever you find the '###PERFORMANCE' you can change this flag. Then you can just save the two blocks in two lists or whatever structure you prefer.

Below a snippet of the code

logFile = "logfile.txt"

with open(logFile) as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]

# flag
are_we_in_the_interesting_block = False;

# two lists to save the liens
interesting_block = [];
non_interesting_block = [];

for line in content:
    # check if there is the text ###PERFORMANCE
    is_there_performance = line.find('###PERFORMANCE');

    # if it's not there, it returns -1
    if is_there_performance > 0:
        are_we_in_the_interesting_block = not are_we_in_the_interesting_block;
    else:    
        if are_we_in_the_interesting_block:
            # here I append to a list, but you can do your processing
            interesting_block.append(line);
        else:
            # here processing of the non interesting parts
            non_interesting_block.append(line);

print('Interesting blocks')
print(interesting_block)

print('\n')
print('Non interesting blocks')
print(non_interesting_block)

And the produced output would be

Interesting blocks
['20190122 09:10,500 number1 string1 string2 string3', '20190122 09:24,670 number2 string1 string2 string3', '20190122 10:05,000 number3 string1 string2 string3', '20190122 10:33,960 number4 string1 string2 string3', '20190122 11:00,321 number5 string1 string2 string3', '20190123 08:10,500 number1 string1 string2 string3', '20190123 08:24,670 number2 string1 string2 string3', '20190123 09:05,000 number3 string1 string2 string3', '20190123 10:33,960 number4 string1 string2 string3', '20190123 10:00,321 number5 string1 string2 string3', '20190124 10:10,500 number1 string1 string2 string3', '20190124 10:24,670 number2 string1 string2 string3', '20190124 11:05,000 number3 string1 string2 string3', '20190124 12:33,960 number4 string1 string2 string3', '20190124 13:00,321 number5 string1 string2 string3']


Non interesting blocks
['20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2', '20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2']

Then you could access interesting_block[n] to get the n-th lines if needed..

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.