Conversion of String to JSON using Regex

Question

I am working on a file which has data in the following format in one of its column. I need to add double quotes to corresponding Key and values so that i can parse the string as a JSON

Input Data: Street:StreetName,Address:ABC Road, CityName,PinCode:00000

I was trying with the below Regex and it somehow messes up the output as Address Key has , inside its value.

Regex i tried is ([a-zA-Z0-9-]+):([a-zA-Z0-9-,]+) with substitution as \"$1\":\"$2\"

Current Output (see the value of Key 'Address'): "Street":"StreetName","Address":"ABC" Road, CityName,"PinCode":"00000"

Expected Output: "Street":"StreetName","Address":"ABC Road, CityName","PinCode":"00000"

However, This works if there are no commas in the values

Input: Street:StreetName,Address:ABC Road CityName,PinCode:00000

Output: "Street":"StreetName","Address":"ABC Road CityName","PinCode":"00000"

I know something is missing in my regex. Any ideas on this please?

Thanks in Advance

You have ; as delimiter between the first fields, and , between the last two - is that correct? — SamWhan
– SamWhan, Commented Jul 5, 2021 at 12:56
@SamWhan My bad it was a typo. The delimiter is only , . Edited now. — Sri Bharath
– Sri Bharath, Commented Jul 5, 2021 at 12:59
If it's not JSON, why do you explicitly need to turn it into JSON and parse it as JSON? Can't you directly parse it as is on its own terms without going through JSON? E.g. split by comma, then by colon? — deceze
– deceze ♦, Commented Jul 5, 2021 at 13:04
@deceze If the field data has commas that won't work. You'll need some kind of validation of the format to extract it. — SamWhan
– SamWhan, Commented Jul 5, 2021 at 13:10
@Sam Absolutely, yes. If the data contains quotes, this JSON "conversion" will fail too. So since both approaches will fail in specific circumstances, a) we'd need to know more about said circumstances and b) you should go with the simpler option either way (which isn't regexen + JSON). — deceze
– deceze ♦, Commented Jul 5, 2021 at 13:14

SamWhan · Accepted Answer · 2021-07-05 13:20:23Z

2

You could try something like

(.*?):(.*?)(?:$|(,)(?=\w+:))

It matches, and captures, the key up to the colon, just matches the colon and finally matches, and captures, the value, up to the end of the string, or a new key (capturing the delimiter preceding to be able to use in the replace below).

Replace that with the captured groups and the correct format string (e.g. "\1" : "\2"\3\r\n) and you're home ;)

See it here at regex101.

Note! In the test for a following key, it makes the assumption keys are word characters only (a-z, A-Z, 0-9 and _). that may have to be adjusted :/

edited Jul 5, 2021 at 13:20

answered Jul 5, 2021 at 13:02

SamWhan

8,3621 gold badge21 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Wiktor Stribiżew · Accepted Answer · 2021-07-05 19:48:06Z

0

You can use

(\w[\w-]*):(.*?)(?=,\s*[\w-]+:|$)

Replace with "$1":"$2" as you have been doing. See the regex demo.

Details:

(\w[\w-]*) - Group 1: a word char and then zero or more word or - chars
: - a colon
(.*?) - Group 2: any zero or more chars other than line break chars, as few as possible
(?=,\s*[\w-]+:|$) - up to a comma, zero or more whitespaces, one or more word/hyphen chars and then a colon or end of string.

Sample code snippet:

import pyspark.sql.functions as F
df.select(F.regexp_replace('str', r'(\w[\w-]*):(.*?)(?=,\s*[\w-]+:|$)', '"$1":"$2"').alias('d')).show()

answered Jul 5, 2021 at 19:48

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Collectives™ on Stack Overflow

Conversion of String to JSON using Regex

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related