3

After solving the problem I asked about in this question, I am trying to optimize performance of the FTS using indexes. I issued on my db the command:

CREATE INDEX my_table_idx ON my_table USING gin(to_tsvector('italian', very_important_field), to_tsvector('italian', also_important_field), to_tsvector('italian', not_so_important_field), to_tsvector('italian', not_important_field), to_tsvector('italian', tags));

Then I edited my model's Meta class as follows:

class MyEntry(models.Model):
    very_important_field = models.TextField(blank=True, null=True)
    also_important_field = models.TextField(blank=True, null=True)
    not_so_important_field = models.TextField(blank=True, null=True)
    not_important_field = models.TextField(blank=True, null=True)
    tags = models.TextField(blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'my_table'
        indexes = [
            GinIndex(
                fields=['very_important_field', 'also_important_field', 'not_so_important_field', 'not_important_field', 'tags'],
                name='my_table_idx'
            )
        ]

But nothing seems to have changed. The lookup takes exactly the same amount of time as before.

This is the lookup script:

from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector

# other unrelated stuff here
vector = SearchVector("very_important_field", weight="A") + \
             SearchVector("tags", weight="A") + \
             SearchVector("also_important_field", weight="B") + \
             SearchVector("not_so_important_field", weight="C") + \
             SearchVector("not_important_field", weight="D")
query = SearchQuery(search_string, config="italian")
rank = SearchRank(vector, query, weights=[0.4, 0.6, 0.8, 1.0]). # D, C, B, A
full_text_search_qs = MyEntry.objects.annotate(rank=rank).filter(rank__gte=0.4).order_by("-rank")

What am I doing wrong?

Edit:

The above lookup is wrapped in a function I use a decorator on to time. The function actually returns a list, like this:

@timeit
def search(search_string):
    # the above code here
    qs = list(full_text_search_qs)
    return qs

Might this be the problem, maybe?

2
  • Can you add the informations of your model and your query instead adding link ? Commented Feb 12, 2018 at 19:03
  • As requested, I edited the post. Commented Feb 12, 2018 at 21:10

2 Answers 2

8

You need to add a SearchVectorField to your MyEntry, update it from your actual text fields and then perform the search on this field. However, the update can only be performed after the record has been saved to the database.

Essentially:

from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField

class MyEntry(models.Model):
    # The fields that contain the raw data.
    very_important_field = models.TextField(blank=True, null=True)
    also_important_field = models.TextField(blank=True, null=True)
    not_so_important_field = models.TextField(blank=True, null=True)
    not_important_field = models.TextField(blank=True, null=True)
    tags = models.TextField(blank=True, null=True)

    # The field we actually going to search.
    # Must be null=True because we cannot set it immediately during create()
    search_vector = SearchVectorField(editable=False, null=True)  

    class Meta:
        # The search index pointing to our actual search field.
        indexes = [GinIndex(fields=["search_vector"])]

Then you can create the plain instance as usual, for example:

# Does not set MyEntry.search_vector yet.
my_entry = MyEntry.objects.create(
    very_important_field="something very important",  # Fake Italien text ;-)
    also_important_field="something different but equally important"
    not_so_important_field="this one matters less"
    not_important_field="we don't care are about that one at all"
    tags="things, stuff, whatever"

Now that the entry exists in the database, you can update the search_vector field using all kinds of options. For example weight to specify the importance and config to use one of the default language configurations. You can also completely omit fields you don't want to search:

# Update search vector on existing database record.
my_entry.search_vector = (
    SearchVector("very_important_field", weight="A", config="italien")
    + SearchVector("also_important_field", weight="A", config="italien")
    + SearchVector("not_so_important_field", weight="C", config="italien")
    + SearchVector("tags", weight="B", config="italien")
)
my_entry.save()

Manually updating the search_vector field every time some of the text fields change can be error prone, so you might consider adding an SQL trigger to do that for you using a Django migration. For an example on how to do that see for instance a blog article on Full-text Search with Django and PostgreSQL.

To actually search in MyEntry using the index you need to filter and rank by your search_vector field. The config for the SearchQuery should match the one of the SearchVector above (to use the same stopword, stemming etc).

For example:

from django.contrib.postgres.search import SearchQuery, SearchRank
from django.core.exceptions import ValidationError
from django.db.models import F, QuerySet

search_query = SearchQuery("important", search_type="websearch", config="italien")
search_rank = SearchRank(F("search_vector"), search_query)
my_entries_found = (
    MyEntry.objects.annotate(rank=search_rank)
    .filter(search_vector=search_query)  # Perform full text search on index.
    .order_by("-rank")  # Yield most relevant entries first.
)
Sign up to request clarification or add additional context in comments.

1 Comment

amazing, this worked right out of the box! thank you so much!
0

I'm not sure but according to postgresql documentation (https://www.postgresql.org/docs/9.5/static/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX):

Because the two-argument version of to_tsvector was used in the index above, only a query reference that uses the 2-argument version of to_tsvector with the same configuration name will use that index. That is, WHERE to_tsvector('english', body) @@ 'a & b' can use the index, but WHERE to_tsvector(body) @@ 'a & b' cannot. This ensures that an index will be used only with the same configuration used to create the index entries.

I don't know what configuration django uses but you can try to remove first argument

6 Comments

What do you mean by removing the first argument? Where should I remove it? If I remove the argument in to_tsvector() I won't be able to search in the correct language.
Do you use this language argument as a config parametr in searchquery?
Also, you can try to store ts_vector data as an additional field in your table, create index on that field and use it in your search queries. And how do you evaluate the amount of time?
I edited the question with some more info about the timing. I'll try to add a field as you suggested and come back with a feedback. If the problem is the fact that I turn the QuerySet into a list I'll have another major problem, because I'm using graphql and there's a bug for which I must return a list when annotating and using order_by("-rank")
And yes, the language argument is used both when I create the index and when I make the query. I tried playing with it but with no effect.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.