Parse string as JSON with Snowflake SQL

Question

I have a field in a table of our db that works like an event-like payload, where all changes to different entities are gathered. See example below for a single field of the object:

'---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: this_utc\nanother_date: 2022-11-30\nutc: another_utc'

Since accessing this field with pure SQL is a pain, I was thinking of parsing it as a JSON so that it would look like this:

{
  "field_one":"1", 
  "field_two": "20", 
  "field_three": "4", 
  "id": "1234",
  "another_id": "5678",
  "some_text": "Hey you",
  "a_date": "2022-11-29",
  "utc": "2022-11-29 15:29:28.159296000 Z",
  "another_date": "2022-11-30",
  "utc": "2022-11-30 13:34:59.000000000 Z"
}

And then just use a Snowflake-native approach to access the values I need.

As you can see, though, there are two fields that are called utc, since one is referring to the first date (a_date), and the second one is referring to the second date (another_date). I believe these are nested in the object, but it's difficult to assess with the format of the field.

This is a problem since I can't differentiate between one utc and another when giving the string the format I need and running a parse_json() function (due to both keys using the same name).

My SQL so far looks like the following:

select
    object,
    replace(object, '---\n', '{"') || '"}' as first,
    replace(first, '\n', '","') as second_,
    replace(second_, ': ', '":"') as third,
    replace(third, '    ', '') as fourth,
    replace(fourth, '  ', '') as last
from my_table

(Steps third and fourth are needed because I have some fields that have extra spaces in them)

And this actually gives me the format I need, but due to what I mentioned around the utc keys, I cannot parse the string as a JSON.

Also note that the structure of the string might change from row to row, meaning that some rows might gather two utc keys, while others might have one, and others even five.

Any ideas on how to overcome that?

Do the number and order in which different entities appear in the string stay the same? — Rajat
– Rajat, Commented Nov 30, 2022 at 21:06

Felipe Hoffa · Accepted Answer · 2022-12-01 06:14:51Z

1

Replace only one occurrence with regexp_replace():

with data as (
    select '---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: this_utc\nanother_date: 2022-11-30\nutc: another_utc' o
)

select parse_json(last2)
from (
    select o,
        replace(o, '---\n', '{"') || '"}' as first,
        replace(first, '\n', '","') as second_,
        replace(second_, ': ', '":"') as third,
        replace(third, '    ', '') as fourth,
        replace(fourth, '  ', '') as last,
        regexp_replace(last, '"utc"', '"utc2"', 1, 2) last2
    from data
)
;

answered Dec 1, 2022 at 6:14

Felipe Hoffa

59.8k23 gold badges185 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Aleix CC Over a year ago

Did not know that, thanks a lot! What if I have more than two keys with the same name, though? Since the number of utc occurrences might not be fixed for every row

Felipe Hoffa Over a year ago

Well, you can chain the same regex for appearance #3, #4, #5 etc and encapsulate it in a SQL UDF if the max is not a crazy number. Just change the "2". Please upvote and accept this answer if it answers the question you asked.

Rajat · Accepted Answer · 2022-12-01 18:39:50Z

0

This may not be what you want but it seems to me that your problem could be solved if the UTC timestamps were to replace the dates preceding it where the keys are not duplicated. You can always calculate dates once you have the timestamps. If this is making sense, see if you can apply your parse_json solution to this output instead

set str='---\nfield_one: 1\nfield_two: 20\nfield_three: 4\nid: 1234\nanother_id: 5678\nsome_text: Hey you\na_date: 2022-11-29\nutc: 2022-11-29 15:29:28.159296000 Z\nanother_date: 2022-11-30\nutc: 2022-11-30 13:34:59.000000000 Z';

               
select regexp_replace($str,'[0-9]{4}-[0-9]{2}-[0-9]{2}\nutc:')

answered Dec 1, 2022 at 18:39

Rajat

5,8733 gold badges14 silver badges30 bronze badges

Comments

Aleix CC · Accepted Answer · 2023-03-07 09:59:36Z

0

In case anyone is looking for a cleaner approach to this problem, I came up with a Python UDF in Snowflake that leverages the ruamel.yaml library, and transforms the YAML into a JSON field without the need of ugly SQL:

create or replace function <your_target_schema>.yaml_to_json(S string)
returns string
language python
runtime_version = 3.8
handler = 'yaml_to_json_py'
packages = ('ruamel.yaml==0.17.21')
as $$

import json
from ruamel.yaml import YAML, parser

def yaml_to_json_py(S):
  if S is not None:
    try:
      input_stream = S
      yaml = YAML(typ='rt', pure=True)
      loaded_yaml = yaml.load(input_stream)
      json_str = json.dumps(loaded_yaml, default=str)
      return json_str
    except parser.ParserError:
      return None
  else:
    return None

$$;

This UDF will take a field from a table as input (it's meant to be in YAML format), convert it into a JSON, and return the latter as output.

answered Mar 7, 2023 at 9:59

Aleix CC

2,1291 gold badge10 silver badges24 bronze badges

1 Comment

Felipe Hoffa Over a year ago

FYI - I love Python in Snowflake, and UDFs make this way prettier than pure SQL -- but the performance is much better if you do SQL only. This depends of course on how much data you have

Collectives™ on Stack Overflow

Parse string as JSON with Snowflake SQL

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related