0

I am working on a file which has data in the following format in one of its column. I need to add double quotes to corresponding Key and values so that i can parse the string as a JSON

Input Data: Street:StreetName,Address:ABC Road, CityName,PinCode:00000

I was trying with the below Regex and it somehow messes up the output as Address Key has , inside its value.

Regex i tried is ([a-zA-Z0-9-]+):([a-zA-Z0-9-,]+) with substitution as \"$1\":\"$2\"

Current Output (see the value of Key 'Address'): "Street":"StreetName","Address":"ABC" Road, CityName,"PinCode":"00000"

Expected Output: "Street":"StreetName","Address":"ABC Road, CityName","PinCode":"00000"

However, This works if there are no commas in the values

Input: Street:StreetName,Address:ABC Road CityName,PinCode:00000

Output: "Street":"StreetName","Address":"ABC Road CityName","PinCode":"00000"

I know something is missing in my regex. Any ideas on this please?

Thanks in Advance

5
  • You have ; as delimiter between the first fields, and , between the last two - is that correct? Commented Jul 5, 2021 at 12:56
  • @SamWhan My bad it was a typo. The delimiter is only , . Edited now. Commented Jul 5, 2021 at 12:59
  • If it's not JSON, why do you explicitly need to turn it into JSON and parse it as JSON? Can't you directly parse it as is on its own terms without going through JSON? E.g. split by comma, then by colon? Commented Jul 5, 2021 at 13:04
  • @deceze If the field data has commas that won't work. You'll need some kind of validation of the format to extract it. Commented Jul 5, 2021 at 13:10
  • @Sam Absolutely, yes. If the data contains quotes, this JSON "conversion" will fail too. So since both approaches will fail in specific circumstances, a) we'd need to know more about said circumstances and b) you should go with the simpler option either way (which isn't regexen + JSON). Commented Jul 5, 2021 at 13:14

2 Answers 2

2

You could try something like

(.*?):(.*?)(?:$|(,)(?=\w+:))

It matches, and captures, the key up to the colon, just matches the colon and finally matches, and captures, the value, up to the end of the string, or a new key (capturing the delimiter preceding to be able to use in the replace below).

Replace that with the captured groups and the correct format string (e.g. "\1" : "\2"\3\r\n) and you're home ;)

See it here at regex101.

Note! In the test for a following key, it makes the assumption keys are word characters only (a-z, A-Z, 0-9 and _). that may have to be adjusted :/

Sign up to request clarification or add additional context in comments.

Comments

0

You can use

(\w[\w-]*):(.*?)(?=,\s*[\w-]+:|$)

Replace with "$1":"$2" as you have been doing. See the regex demo.

Details:

  • (\w[\w-]*) - Group 1: a word char and then zero or more word or - chars
  • : - a colon
  • (.*?) - Group 2: any zero or more chars other than line break chars, as few as possible
  • (?=,\s*[\w-]+:|$) - up to a comma, zero or more whitespaces, one or more word/hyphen chars and then a colon or end of string.

Sample code snippet:

import pyspark.sql.functions as F
df.select(F.regexp_replace('str', r'(\w[\w-]*):(.*?)(?=,\s*[\w-]+:|$)', '"$1":"$2"').alias('d')).show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.