I need to query the top few texts that are most similar based on the input content. The table structure is as follows:
create table documents (
id bigserial primary key,
content text,
content_similarity_tokens text,
content_index_tokens text,
index_tokens_vector tsvector generated always as (to_tsvector('simple', content_index_tokens)) stored
);
create index search_index_tokens_vector on documents using gin (index_tokens_vector);
-------------------------------------------------------------------------------------------
table documents comment:
content: the content
content_similarity_tokens:Content with punctuation removed
content_index_tokens: tokenized text
index_tokens_vector: vector storing content_index_tokens
It works very well when performing a full-text search on user-entered text, for example:
select * from documents where index_tokens_vector @@ plainto_tsquery('user input text tokens');
But I encountered some problems when querying similar content.
Directly use the similarity function to query, but the query speed is very slow when the amount of data is large.
select id, content, similarity(content, 'user input text') as sim
from documents
where similarity(content, 'user input text') > 0.7
order by sim desc;
So I tried another way, first segmenting the content entered by the user, and then filtering the data through the index_tokens_vector field before performing similarity matching. eg:
select id, content, similarity(content, 'user input text') as sim
from documents
where index_tokens_vector @@ plainto_tsquery('user input text tokens')
and similarity(content, 'user input text') > 0.7
order by sim desc;
It looks effective, but I still found some issues.
1.If the result set remains large after filtering through where index_tokens_vector @@ plainto_tsquery('user input text tokens'), the query speed will still be slow.
2.For example, in the database, the tokenization of text1:'I am 18 years old' is 18 years old and the user-input text2: 'I am 19 years old' is tokenized as 19 years old Since the token 19 is not present in the tokens of text1 this record will be filtered out. However, their similarity is very high.
select similarity('I am 18 years old', 'I am 19 years old')
# similarity: 0.8
So, How can I improve the query speed for similar content?