I have about 800 transcripts from vods in json format from openai/whisper and want to store it in postgres, index the transcript and make it searchable as fast as possible with tsvector. This is part of a bigger archive storing hundreds of hours of transcripts.
This is an example json file I get from whisper (truncated segments, only text is important for this question):
{
"segments": [
{
"id": 1,
"start": 90.0,
"end": 112.52,
"text": "Ich habe heute endlich mal alles von dem iPhone, was ich von der Firma hatte umgestellt auf"
},
{
"id": 2,
"start": 112.52,
"end": 117.24,
"text": "mein neues iPhone, weil das andere muss ich abgeben \u00fcbermorgen."
},
{
"id": 3,
"start": 117.24,
"end": 128.88,
"text": "Es n\u00e4hert sich ja dem Ende alles. Mir ist wieder mal aufgefallen, wie abartig es ist,"
},
{
"id": 4,
"start": 128.88,
"end": 136.16,
"text": "Daten vom iPhone runterzubekommen. Also was ja in der Regel recht gut geht,"
},
{
"id": 5,
"start": 136.16,
"end": 145.76,
"text": "ist wenn man das gleich beim Setup macht, die einen Sachen vom alten aufs neue iPhone zu kopieren."
}
]
}
I have a table called vods with multiple columns, but the 2 most important are:
Transcript- jsonb - Holding the raw json dataTranscriptVector- tsvector - indexed with gin
Now, I want to search the text value in jsonb using my gin index and only get back the matching array items in my object. What would my sql query look like to only return the matching segments?
So far, I'm using the following:
select * from vods where vods.transcript_vector @@ websearch_to_tsquery('german', 'iPhone')
but it gives me the whole jsonb. I only want the matching semgments. I've tried using jsonb_array_elements(vods.transcript) but couldn't come up with a working solution.
I've create a dbfiddle here: https://dbfiddle.uk/F3XmSn-H
EDIT: I use postgres 15