Extract JSON from Text in python

Question

I want to extract JSON/dictionary from a log text.

The Sample log text:

2018-06-21 19:42:58 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'locations', 'CLOSESPIDER_TIMEOUT': '14400', 'FEED_FORMAT': 'geojson', 'LOG_FILE': '/geojson_dumps/21_Jun_2018_07_42_54/logs/coastalfarm.log', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'locations.spiders', 'SPIDER_MODULES': ['locations.spiders'], 'TELNETCONSOLE_ENABLED': '0', 'USER_AGENT': 'Mozilla/5.0'}

2018-06-21 19:43:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1718,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
 'item_scraped_count': 4,
 'log_count/INFO': 8,
 'memusage/max': 56856576,
 'memusage/startup': 56856576,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}

2018-06-21 19:43:00 [scrapy.core.engine] INFO: Spider closed (finished)

I have tried (\{.+$\}) as the regex expression but it gives me the the dict which is on single line, {'BOT_NAME': 'locations',..., 'USER_AGENT': 'Mozilla/5.0'} which is not what is expected.

The json/dictionary I want to extract: Note: The dictionary would not the same keys, it could differ.

{'downloader/request_bytes': 369,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1718,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 6, 21, 14, 13, 0, 841666),
 'item_scraped_count': 4,
 'log_count/INFO': 8,
 'memusage/max': 56856576,
 'memusage/startup': 56856576,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 6, 21, 14, 12, 58, 499385)}

If you can extract the correct string from the log, then just parse it using json module, see stackoverflow.com/questions/4917006/… , you will get dictionary object. — dr_agon
– dr_agon, Commented Jul 1, 2018 at 16:05

JohnKlehm · Accepted Answer · 2018-07-01 16:20:01Z

3

Edit: The JSON spans multiple lines so here's what will do it:

import re

re_str = '\d{2}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \[scrapy\.statscollectors] INFO: Dumping Scrapy stats:.({.+?\})'
stats_re = re.compile(re_str, re.MULTILINE | re.DOTALL)

for match in stats_re.findall(log):
    print(match)

If you are after only the lines from the statscollector then this should get you there (assuming that it's all on one line too):

^.*?\[scrapy.statscollectors] INFO: Dumping Scrapy stats: (\{.+$\}).*?$

edited Jul 1, 2018 at 16:20

answered Jul 1, 2018 at 15:01

JohnKlehm

2,39816 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

JohnKlehm Over a year ago

Put several lines of the log in pastebin somewhere and I'll take a look.

dr_agon Over a year ago

Remove the $ within the group, and ? are not nieeded. The correct expression would be: ^.*\[scrapy\.statscollectors] INFO: Dumping Scrapy stats: (\{.+\}).*$

vdkotian Over a year ago

mul_line_json = re.compile('^.*[scrapy\.statscollectors] INFO: Dumping Scrapy stats: (\{.+\}).*$', re.MULTILINE) re.findall(mul_line_json, data) still no output

JohnKlehm Over a year ago

I've edited my answer with code that works with the pastbin.

Dustin Oprea · Accepted Answer · 2021-11-07 06:40:43Z

Using a JSON tokenizer makes this a very simple and efficient task, as long as you have an anchor to search for in the original document that allows you to at least identify the beginning of the JSON blob. This uses json-five to extract JSON from HTML:

import json5.tokenizer

with open('5f32d5b4e2c432f660e1df44.html') as f:
    document = f.read()

search_for = "window.__INITIAL_STATE__="
i = document.index(search_for)
j = i + len(search_for)
extract_from = document[j:]

tokens = json5.tokenizer.tokenize(extract_from)
stack = []
collected = []
for token in tokens:
    collected.append(token.value)

    if token.type in ('LBRACE', 'LBRACKET'):
        stack.append(token)
    elif token.type in ('RBRACE', 'RBRACKET'):
        stack.pop()

    if not stack:
        break

json_blob = ''.join(collected)

Note that this accounts for the JSON both being a complex (object, list) or scalar type.

Collectives™ on Stack Overflow

Extract JSON from Text in python

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related