Detect semantically block of text with Python

Question

I have this block of sample log text:

20190122 09:00,000 ###PERFORMANCE string1 string2 string3
20190122 09:10,500 number1 string1 string2 string3
20190122 09:24,670 number2 string1 string2 string3
20190122 10:05,000 number3 string1 string2 string3
20190122 10:33,960 number4 string1 string2 string3
20190122 11:00,321 number5 string1 string2 string3
20190122 11:40,256 ###PERFORMANCE string1 string2 string3
20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2
20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2
20190123 08:00,000 ###PERFORMANCE string1 string2 string3
20190123 08:10,500 number1 string1 string2 string3
20190123 08:24,670 number2 string1 string2 string3
20190123 09:05,000 number3 string1 string2 string3
20190123 10:33,960 number4 string1 string2 string3
20190123 10:00,321 number5 string1 string2 string3
20190123 13:40,256 ###PERFORMANCE string1 string2 string3
20190124 10:00,000 ###PERFORMANCE string1 string2 string3
20190124 10:10,500 number1 string1 string2 string3
20190124 10:24,670 number2 string1 string2 string3
20190124 11:05,000 number3 string1 string2 string3
20190124 12:33,960 number4 string1 string2 string3
20190124 13:00,321 number5 string1 string2 string3
20190124 13:40,256 ###PERFORMANCE string1 string2 string3

What I would like to do with Python is to detect each ###PERFORMANCE block of text like in this example:

As you can see, there are 3 blocks of interest, each one delimited by the text ###PERFORMANCE in the string. The first start at line 1 and ends at line 7. What is between line 7 and 10 must not be treated as a block of interest. Lines of strings for each block could also vary (so going by lines number would not be a good idea).

What I have done until now was just to read the text file line by line:

logFile = "testLog.txt"

with open(logFile) as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]

for line in content:
    print(line)

Which way I could approach to achieve this task ? Would using the NLTK be a good idea ? Would it even work for this task ? Any general suggestion ?

Is this an actual data number1 string1 string2 string3 or does that mean it will have number and three different strings? — DirtyBit
– DirtyBit, Commented Jan 24, 2019 at 9:49
it means that in the whole string could be different data types and not limited to one number and 3 strings. not actual data, it's just for SO example — lucians
– lucians, Commented Jan 24, 2019 at 9:52

daramcq · Accepted Answer · 2019-01-24 09:56:57Z

1

As you are simply matching on the PERFORMANCE delimiter, using NLTK seems like overkill. A simple approach to this is to use a simple match (is the expected string on the line) and then toggle your capture-mode based on that. For instance:

in_block = False
IDENTIFIER = 'PERFORMANCE'
with open(logfile) as f:
    for line in f.readlines():
        if IDENTIFIER in line:
            # Toggle the boolean
            in_block = not in_block
        if in_block:
            print(line)

answered Jan 24, 2019 at 9:56

daramcq

1742 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lucians Over a year ago

that's amazing. it seems to work on this (simple) example posted here. will test as soon as possible on a real log to see it's beahviour. thanks for now.

freerafiki · Accepted Answer · 2019-01-24 10:02:50Z

I think what you need can be done with a simple check. Let me explain if I got it correctly. You can have a flag (True/False value) to detect if you are in the interesting block or not. Whenever you find the '###PERFORMANCE' you can change this flag. Then you can just save the two blocks in two lists or whatever structure you prefer.

Below a snippet of the code

logFile = "logfile.txt"

with open(logFile) as f:
    content = f.readlines()
# you may also want to remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]

# flag
are_we_in_the_interesting_block = False;

# two lists to save the liens
interesting_block = [];
non_interesting_block = [];

for line in content:
    # check if there is the text ###PERFORMANCE
    is_there_performance = line.find('###PERFORMANCE');

    # if it's not there, it returns -1
    if is_there_performance > 0:
        are_we_in_the_interesting_block = not are_we_in_the_interesting_block;
    else:    
        if are_we_in_the_interesting_block:
            # here I append to a list, but you can do your processing
            interesting_block.append(line);
        else:
            # here processing of the non interesting parts
            non_interesting_block.append(line);

print('Interesting blocks')
print(interesting_block)

print('\n')
print('Non interesting blocks')
print(non_interesting_block)

And the produced output would be

Interesting blocks
['20190122 09:10,500 number1 string1 string2 string3', '20190122 09:24,670 number2 string1 string2 string3', '20190122 10:05,000 number3 string1 string2 string3', '20190122 10:33,960 number4 string1 string2 string3', '20190122 11:00,321 number5 string1 string2 string3', '20190123 08:10,500 number1 string1 string2 string3', '20190123 08:24,670 number2 string1 string2 string3', '20190123 09:05,000 number3 string1 string2 string3', '20190123 10:33,960 number4 string1 string2 string3', '20190123 10:00,321 number5 string1 string2 string3', '20190124 10:10,500 number1 string1 string2 string3', '20190124 10:24,670 number2 string1 string2 string3', '20190124 11:05,000 number3 string1 string2 string3', '20190124 12:33,960 number4 string1 string2 string3', '20190124 13:00,321 number5 string1 string2 string3']


Non interesting blocks
['20190123 10:24,670 number1 string1 string2 string3 string4 date1 number2', '20190123 10:32,130 number1 string1 string2 string3 string4 date1 number2']

Then you could access interesting_block[n] to get the n-th lines if needed..

Collectives™ on Stack Overflow

Detect semantically block of text with Python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related