Python, filter vectors from Pinecone vector store based on a field saved in the metadata of these vectors

Question

I have vectors stored in a Pinecone vector store, each vector represents a content of a pdf file:

Metadata:: hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d" id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"

I saved a new field in the metadata, which is the hash_code of the pdf content, to avoid adding the same file again and again to the vector store.

To do that, I'm getting the new hash codes of the new documents that I want to add, then I want to scan the existing ones to find if any of them already exists and then filter it out.

I'm using python, and tried such a code, but didn't manage to acheive my goal yet:

First method:

def filter_existing_docs(index_name, docs):
   # Initialize the Pinecone index
   index = pinecone_client.Index(index_name)

   # Extract hash_codes from the docs list using the appropriate method for your Document objects
   hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
   print("Hash Codes:", hash_codes)

   # Fetch by list of hash_codes (ensure hash_codes are valid ids)
   fetch_response = index.fetch(ids=hash_codes)  
   print("Fetch Response:", fetch_response)

   # Get the existing hash_codes that are already in the Pinecone index
   existing_hash_codes = set(fetch_response.get('vectors', {}).keys())  # Extract existing IDs from the response
   print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

   # Filter out the docs that have already been added to Pinecone
   filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
   print("2 -----------> Filtered Docs:", len(filtered_docs))

   return filtered_docs

Then tried another approach:

def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
    top_k=100,  # Set a suitable `top_k` to return a reasonable number of documents
    include_metadata=True,
    #namespace=namespace
)

# Debug: Print the query response to see its structure
print("Query Response:", query_response)

# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs

SivaTN · Accepted Answer · 2025-01-25 19:08:38Z

0

Iterate through the new document hash codes
Query Pinecone using each hash as a metadata filter
If there are 0 results, add the corresponding file. Else, the file is already present, so skip it.

answered Jan 25 at 19:08

SivaTN

12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sahil · Accepted Answer · 2025-01-26 07:44:50Z

0

You can create a naming convention for each chunk, like "doc1#hash", "doc2#hash"
You can also filter records based on id prefixes, eg

    for ids in index.list(prefix='doc1#', namespace=''):
    print(ids)

You can use any prefix pattern you like, but make sure you use a consistent prefix pattern for all child records of a document.

ex:

    doc1#chunk1
    doc1_chunk1
    doc1___chunk1
    doc1:chunk1
    doc1chunk1

Refrence : pinecone-docs/id-prefixes

edited Jan 26 at 7:44

answered Jan 26 at 5:29

sahil

91 silver badge6 bronze badges

Collectives™ on Stack Overflow

Python, filter vectors from Pinecone vector store based on a field saved in the metadata of these vectors

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related