0

I'm parsing through a text a sometimes I get the following

{"name":"John","last" : Doe", "Food":"Fries","Coffee" : "Need}

I'm dealing with someone else's data here so I just have to deal with it.

Is there a possible use of regex expressions (or anything else for that matter) Where I can read through the file and whenever I find unmatched quotations modify the file by matching them.

So I can end up with

{"name":"John","last" : "Doe", "Food":"Fries","Coffee" : "Need"}
6
  • is the unmatched quote always the last thing before a closing bracket? Commented Jun 14, 2018 at 21:14
  • @MoxieBall Nope Doe" also has. I presume it can be anywhere Commented Jun 14, 2018 at 21:15
  • @MoxieBall It can be anywhere Commented Jun 14, 2018 at 21:17
  • Do your real-life strings contain only letters, or at least not contain any characters with special meaning to JSON like []{}"\:,? Commented Jun 14, 2018 at 21:26
  • 1
    What you're asking for is basically impossible in general, because it's ambiguous—but it may be possible, or even dead easy, for your particular data set. For example, if none of those special characters ever appear in your JSON strings, you know that an unclosed quote was supposed to end at the next one of ,:]}, and an opened quote is only a little more complicated. But if you have to handle strings like "spam:\"eggs\"}" that may be missing quotes, that's a different story. Commented Jun 14, 2018 at 21:28

1 Answer 1

1

If missing quotation marks are the only problem with the text and there are no escaped quotation marks within the fields, then you can repair the text by looking for the four types of irregularities.

s = '{name":"John","last" : Doe", "Food:"Fries","Coffee" : "Need}' 

A missing quotation mark after a semicolon:

s = re.sub('"\s*:\s*(?=[^\s"])', '":"', s)

A missing quotation mark before a semicolon:

s = re.sub('(?<=[^\s"])\s*:\s*"', '":"', s)

A missing quotation mark before the closing brace:

s = re.sub('(?<=[^\s"])\s*\}', '"}', s)

A missing quotation mark after the opening brace:

s = re.sub('\{\s*(?=[^\s"])', '{"', s)

Apply all four transformations one after another, and hopefully the problem is gone:

print(s)
#{"name":"John","last":"Doe", "Food":"Fries","Coffee" : "Need"}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.