0

I'm using the OpenAi text-embedding-3-small model to create embeddings for each product category in a file. In total it's about 6000 product categories and they look like this:

Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Off-Road & All-Terrain Vehicle Protective Gear
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Off-Road & All-Terrain Vehicle Protective Gear > ATV & UTV Bar Pads
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks > Automotive Alarm Accessories
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks > Automotive Alarm Systems
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks > Motorcycle Alarms & Locks

For each line in that file, I'm using the following code to generate an embedding:

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    input="Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks",
    model="text-embedding-3-small",
    encoding_format="float",
    dimensions=512
)

I'm storing the embeddings in a vector database (Cosmos DB for MongoDB). I'm running a vector similarity search on the DB in order to help customers, to find the best possible category for their entered product title. The similarity search works very well, but sometimes I'm getting bad results. For example, when I search for "Pinus Sylvestris" which is the name of a plant, I'm getting an entirely wrong product category suggested.

My question: Is it OK, to pass the product category in that hierarchical representation (with > character) into the model? Is there a way, how I can tell the model, that this is a product category for an e-commerce website, so that it understands the input better?

Edit: Adding the query code:

from openai import OpenAI
from pymongo import MongoClient
import sys

MONGODB_CON_STR=XXXXXX

db = MongoClient(MONGODB_CON_STR)["shop"]
client = OpenAI()

def get_vector_for_text(input:str):

    response = client.embeddings.create(
        input=input,
        model="text-embedding-3-small",
        encoding_format="float",
        dimensions=512
    ) 
    return response.data[0].embedding

 
for line in sys.stdin:
    queryVector = get_vector_for_text(line)
    res = db["product_taxonomy"].aggregate([
    {
        "$search": {
        "cosmosSearch": {
            "vector": queryVector,
            "path": "vector",
            "k": 2
        },
        "returnStoredSource": True }},
    {
        "$project": { "similarityScore": {
            "$meta": "searchScore" },
                "document" : "$$ROOT"
            }
    }
    ]);
    
    while res.alive:
        for doc in res:
            print(f'\tsimilarityScore: {doc["similarityScore"]} {doc["document"]["text"]}')
        print('\n')  
4
  • You could try replacing > with / and add a tiny “role prefix” (gives the model context). such as E-commerce category path: Commented Aug 17 at 3:22
  • 2
    for example, Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks > Motorcycle Alarms & Locks to E-commerce category path: Vehicles & Parts/Vehicle Parts & Accessories/Vehicle Safety & Security/Vehicle Alarms & Locks/Motorcycle Alarms & Locks . This simple hint might reduce the weird matches Commented Aug 17 at 3:24
  • 1
    Could you please edit your question to include the relevant query code? Commented Aug 19 at 9:35
  • 1
    @meysam I've added the query code. Commented Aug 19 at 18:39

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.