I am doing some research where I have +25,000 reports in one large text-file. Each report is divided by "TEXTSTART[UNIQUE-ID]" and "TEXTEND".
So far I have succeded in reading a single report (that is text between the indentifiers) from the txt-file with this code:
f = open("samples_combined_incomplete.txt","r" )
report = f.read()
f.close()
rstart = "TEXTSTART"
rend = "TEXTEND"
a = ((report.split(rstart))[1].split(rend)[0])
print (a)
My question is this; how can I divide the text-document into uniquely identifiable substrings, based on TEXTSTART[UNIQUE-ID]? And how should the ID be returned?
I am just starting, so any advise on documentation, useful functions, etc. would be much appriciated.
Thank you, works like a charm! The IDs are a combination of numbers and characters FYI.
f = open("samples_combined_incomplete.txt","r" )
report = f.read()
f.close()
rstart = "TEXTSTART"
rend = "TEXTEND"
a = 0
dict = re.findall('TEXTSTART\[(.*?)\](.*?)TEXTEND', report, re.DOTALL)
while a < 10:
print (dict[a])
a += 1
If I want to search within the containers for a specific keyword and have the keys returned, how could I do that?