0

I need to query the top few texts that are most similar based on the input content. The table structure is as follows:

create table documents (
    id                                 bigserial primary key,
    content                            text,
    content_similarity_tokens          text,
    content_index_tokens               text,
    index_tokens_vector tsvector generated always as (to_tsvector('simple', content_index_tokens)) stored
);

create index search_index_tokens_vector  on documents using gin (index_tokens_vector);
-------------------------------------------------------------------------------------------
table documents comment:
    content: the content
    content_similarity_tokens:Content with punctuation removed
    content_index_tokens: tokenized text
    index_tokens_vector: vector storing content_index_tokens

It works very well when performing a full-text search on user-entered text, for example:

select * from documents where index_tokens_vector @@ plainto_tsquery('user input text tokens');

But I encountered some problems when querying similar content.

Directly use the similarity function to query, but the query speed is very slow when the amount of data is large.

select id, content, similarity(content, 'user input text') as sim
from documents
where similarity(content, 'user input text') > 0.7
order by sim desc;

So I tried another way, first segmenting the content entered by the user, and then filtering the data through the index_tokens_vector field before performing similarity matching. eg:


select id, content, similarity(content, 'user input text') as sim
from documents
where index_tokens_vector @@ plainto_tsquery('user input text tokens')
  and similarity(content, 'user input text') > 0.7
order by sim desc;

It looks effective, but I still found some issues.

1.If the result set remains large after filtering through where index_tokens_vector @@ plainto_tsquery('user input text tokens'), the query speed will still be slow.

2.For example, in the database, the tokenization of text1:'I am 18 years old' is 18 years old and the user-input text2: 'I am 19 years old' is tokenized as 19 years old Since the token 19 is not present in the tokens of text1 this record will be filtered out. However, their similarity is very high.

select similarity('I am 18 years old', 'I am 19 years old') 
# similarity: 0.8

So, How can I improve the query speed for similar content?

2
  • There is no "magically do what I want". You have to define very precisely what you mean by "similar content". Note that full-text search is not usable for similarity search at all. Commented Jan 30, 2024 at 12:29
  • Use a proper index and operator (not function) for the similarity search: dba.stackexchange.com/questions/103821/… Commented Jan 30, 2024 at 14:00

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.