0

I have about 800 transcripts from vods in json format from openai/whisper and want to store it in postgres, index the transcript and make it searchable as fast as possible with tsvector. This is part of a bigger archive storing hundreds of hours of transcripts.

This is an example json file I get from whisper (truncated segments, only text is important for this question):

{
  "segments": [
    {
      "id": 1,
      "start": 90.0,
      "end": 112.52,
      "text": "Ich habe heute endlich mal alles von dem iPhone, was ich von der Firma hatte umgestellt auf"
    },
    {
      "id": 2,
      "start": 112.52,
      "end": 117.24,
      "text": "mein neues iPhone, weil das andere muss ich abgeben \u00fcbermorgen."
    },
    {
      "id": 3,
      "start": 117.24,
      "end": 128.88,
      "text": "Es n\u00e4hert sich ja dem Ende alles. Mir ist wieder mal aufgefallen, wie abartig es ist,"
    },
    {
      "id": 4,
      "start": 128.88,
      "end": 136.16,
      "text": "Daten vom iPhone runterzubekommen. Also was ja in der Regel recht gut geht,"
    },
    {
      "id": 5,
      "start": 136.16,
      "end": 145.76,
      "text": "ist wenn man das gleich beim Setup macht, die einen Sachen vom alten aufs neue iPhone zu kopieren."
    }
  ]
}

I have a table called vods with multiple columns, but the 2 most important are:

  • Transcript - jsonb - Holding the raw json data
  • TranscriptVector - tsvector - indexed with gin

Now, I want to search the text value in jsonb using my gin index and only get back the matching array items in my object. What would my sql query look like to only return the matching segments?

So far, I'm using the following:

select * from vods where vods.transcript_vector @@ websearch_to_tsquery('german', 'iPhone')

but it gives me the whole jsonb. I only want the matching semgments. I've tried using jsonb_array_elements(vods.transcript) but couldn't come up with a working solution.

I've create a dbfiddle here: https://dbfiddle.uk/F3XmSn-H


EDIT: I use postgres 15

1 Answer 1

0

I probably wouldn't the store the data in this format in the first place, but...

Put the jsonb_array_elements in the FROM list, then just filter on the column derived from it (which column is named "value", unless you deploy an alias to change it):

SELECT * from vods, jsonb_array_elements((jsonb_path_query_array(transcript, 'strict $.segments[*].text'))) 
WHERE vods.transcript_vector @@ websearch_to_tsquery('german', 'iPhone') 
  AND to_tsvector('german',value)  @@ websearch_to_tsquery('german', 'iPhone');

The first @@ match is formally unnecessary, but it will quickly eliminate rows which have zero matching elements based on your gin index. (But doing that might be pointless on table of only 800 rows.) This stills to dig through all the individual elements of every row which has at least one matching element. To avoid that, see my opening sentence.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.