2

I'm trying to convert this CSV

dir1/key1=val1/,MOVED_TO,123.pq
dir2/key1=val2/key2=val3/,MOVED_TO,456.pq

into this desired output (line-oriented json):

{"type":"quux","top"="dir1","key1":"val1","event":"MOVED_TO"}
{"type":"quux","top"="dir2","key1":"val2","key2":"val3","event":"MOVED_TO"}

So far I've gotten close, but I still have $key nested.

echo -e 'dir1/key1=val1/,MOVED_TO,123.pq
dir2/key1=val2/key2=val3/,MOVED_TO,456.pq' | \
  mlr --hi --records-per-batch 1 --csv --allow-ragged-csv-input --hi --no-jlistwrap --ojsonl \
  label dirname,event,filename \
  then put '$type = "quux"; $top = sub($dirname,"/.*",""); $key = splitkv(sub(sub($dirname, "^[^/]*/", ""), "/$", ""), "=", "/")' \
  then cut -f type,top,key,event

{"event": "MOVED_TO", "type": "quux", "top": "dir1", "key": {"key1": "val1"}}
{"event": "MOVED_TO", "type": "quux", "top": "dir2", "key": {"key1": "val2", "key2": "val3"}}

I've been trying to use concat(.) but that seems to work on arrays and not maps as I have there.

The "type":"quux" is a static string I need to assign during the process, it is not present in the raw CSV.

How can I unnest $key?

3
  • where is "type":"quux" in your input csv? Commented Nov 2 at 8:13
  • This process is done in various places where "quux" will change based on location. It is not in the raw data, ergo why I'm adding it as a string literal in the mlr code. Commented Nov 2 at 14:24
  • ok, I have a reply for you, without type field Commented Nov 3 at 10:00

1 Answer 1

1

I managed to solve this! The key was using Miller's nest --explode with the right combination of separators:

mlr --icsv --hi --ojsonl --from input.csv \
  cat \
  then put '$top=regextract($1,"dir\d+");$1=sub($1,"dir\d+/","");$1=sub($1,"/$","")' \
  then cat -n \
  then rename 1,keys,2,event \
  then nest --explode --pairs --across-fields --nested-fs "/" --nested-ps "=" -f keys \
  then cut -x -f 3

Output:

{"n": 1, "key1": "val1", "event": "MOVED_TO", "top": "dir1"}
{"n": 2, "key1": "val2", "key2": "val3", "event": "MOVED_TO", "top": "dir2"}

The key steps are:

  1. Extract the top directory prefix using regextract()

  2. Clean up the first field by removing the prefix and trailing slash

  3. Add a record number with cat -n

  4. Rename fields to prepare for nesting

  5. Use nest --explode --pairs --across-fields with --nested-fs "/" (field separator) and --nested-ps "=" (pair separator) to properly split and unnest the key-value pairs

  6. Remove the 3 column with cut -x -f 3

The crucial part was using --across-fields instead of --across-records, and specifying both --nested-fs for splitting the pairs and --nested-ps for splitting each key from its value. This properly flattens the nested structure into the parent record.

Sign up to request clarification or add additional context in comments.

5 Comments

adding then put '$type="quux" seals the deal, thanks aborruso!
Does adding a record number with cat -n enable anything or guard against anything? It isn't in the expected output and not something I need.
I prefer to add it to debug. You can remove it at the end
mlr is a little new to me, it's good to see good dev principles as well, and that makes a lot of sense. Thank you again!
I think it's a very great product. For other questions please write an issue in the official repo

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.