I'm using the OpenAi text-embedding-3-small model to create embeddings for each product category in a file. In total it's about 6000 product categories and they look like this:
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Off-Road & All-Terrain Vehicle Protective Gear
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Off-Road & All-Terrain Vehicle Protective Gear > ATV & UTV Bar Pads
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks > Automotive Alarm Accessories
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks > Automotive Alarm Systems
Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks > Motorcycle Alarms & Locks
For each line in that file, I'm using the following code to generate an embedding:
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
input="Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks",
model="text-embedding-3-small",
encoding_format="float",
dimensions=512
)
I'm storing the embeddings in a vector database (Cosmos DB for MongoDB). I'm running a vector similarity search on the DB in order to help customers, to find the best possible category for their entered product title. The similarity search works very well, but sometimes I'm getting bad results. For example, when I search for "Pinus Sylvestris" which is the name of a plant, I'm getting an entirely wrong product category suggested.
My question: Is it OK, to pass the product category in that hierarchical representation (with > character) into the model? Is there a way, how I can tell the model, that this is a product category for an e-commerce website, so that it understands the input better?
Edit: Adding the query code:
from openai import OpenAI
from pymongo import MongoClient
import sys
MONGODB_CON_STR=XXXXXX
db = MongoClient(MONGODB_CON_STR)["shop"]
client = OpenAI()
def get_vector_for_text(input:str):
response = client.embeddings.create(
input=input,
model="text-embedding-3-small",
encoding_format="float",
dimensions=512
)
return response.data[0].embedding
for line in sys.stdin:
queryVector = get_vector_for_text(line)
res = db["product_taxonomy"].aggregate([
{
"$search": {
"cosmosSearch": {
"vector": queryVector,
"path": "vector",
"k": 2
},
"returnStoredSource": True }},
{
"$project": { "similarityScore": {
"$meta": "searchScore" },
"document" : "$$ROOT"
}
}
]);
while res.alive:
for doc in res:
print(f'\tsimilarityScore: {doc["similarityScore"]} {doc["document"]["text"]}')
print('\n')
>with/and add a tiny “role prefix” (gives the model context). such asE-commerce category path:Vehicles & Parts > Vehicle Parts & Accessories > Vehicle Safety & Security > Vehicle Alarms & Locks > Motorcycle Alarms & LockstoE-commerce category path: Vehicles & Parts/Vehicle Parts & Accessories/Vehicle Safety & Security/Vehicle Alarms & Locks/Motorcycle Alarms & Locks. This simple hint might reduce the weird matches