-4

I am trying to extract values for colorName from the following strings located in <script> of an HTML page.

\\"colorName\\":\\"GLOSS REDSKY SHDWSIL WHT IMPASTO\\"
\\"colorName\\":\\"GLOSS PREMIUM FJORD METALLIC / WHITE METALLIC SILVER\\"

The HTML is returned in response.text using Python Scrapy. I want to extract GLOSS REDSKY SHDWSIL WHT IMPASTO and GLOSS PREMIUM FJORD METALLIC / WHITE METALLIC SILVER from the code snippet using regex.

re.findall('\\\\"colorName\\\\":\\\\"(.*?)\\\\"', response.text)

This line of code works fine, but when I tried to put the regex in a JSON string like this:

{
    "selector": "\\\\"colorName\\\\":\\\\"(.*?)\\\\""
}

I got the following errors:

Error: Parse error on line 4:
...  "selector": "\\\\"colorName\\\\":\\\\"
-----------------------^
Expecting 'EOF', '}', ':', ',', ']', got 'undefined'

PyCharm suggested the following edit to the JSON string, which didn't throw any error:

{
    "selector": "\\\\\\\\\"colorName\\\\\\\\\":\\\\\\\\\"(.*?)\\\\\\\\\""
}

I cannot figure out why I need to add so many extra backslashes into the JSON string to make it right.

2
  • 3
    Because backslashes have meaning in both Python and JSON strings and regular expressions, as well as appearing several times in the context you want to match. Read the opening paragraphs of docs.python.org/3/library/re.html. Commented Nov 15 at 10:25
  • Working backwards this {"selector": "\\\\\\\\\"colorName\\\\\\\\\":\\\\\\\\\"(.*?)\\\\\\\\\""} is a valid JSON Object containing a single key/val pair where this is the string value "\\\\\\\\\"colorName\\\\\\\\\":\\\\\\\\\"(.*?)\\\\\\\\\"". Unescaping that double quoted string gives \\\\"colorName\\\\":\\\\"(.*?)\\\\" which as a regex, matches the target string regex101.com/r/8r9nwN/1 Question is, what's the question ? Commented Nov 17 at 18:14

1 Answer 1

4

The need to double escape backslashes in a regex context likely has to do with that backslash itself is a regex metacharacter (i.e. backslash has semantic meaning in a regular expression and therefore must be escaped to indicate a literal backslash).

But, don't use regex to parse JSON content. Instead, use the json library to parse your JSON string:

import json

with open('input.json', 'r') as file:
    data = json.load(file)
    print(data["colorName"])
Sign up to request clarification or add additional context in comments.

2 Comments

JSON library's might not contain the tools necessary to do sophisticated manipulation that regex can do. So please don't rule out the use of regex to parse JSON. stackoverflow.com/a/79785886/15577665
I partially agree. I can definitely see the use of regex on data which has been extracted from JSON. But the json library should not have issues with mapping out the keys and values themselves.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.