2

I have vectors stored in a Pinecone vector store, each vector represents a content of a pdf file:

Metadata:: hash_code: "d53d7ec8b0e66e9a83a97acda09edd3fe9867cadb42833f9bf5525cc3b89fe2d" id: "cc54ffbe-9cba-4de9-9f30-a114e4c3c3fb"

I saved a new field in the metadata, which is the hash_code of the pdf content, to avoid adding the same file again and again to the vector store.

To do that, I'm getting the new hash codes of the new documents that I want to add, then I want to scan the existing ones to find if any of them already exists and then filter it out.

I'm using python, and tried such a code, but didn't manage to acheive my goal yet:

First method:

def filter_existing_docs(index_name, docs):
   # Initialize the Pinecone index
   index = pinecone_client.Index(index_name)

   # Extract hash_codes from the docs list using the appropriate method for your Document objects
   hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
   print("Hash Codes:", hash_codes)

   # Fetch by list of hash_codes (ensure hash_codes are valid ids)
   fetch_response = index.fetch(ids=hash_codes)  
   print("Fetch Response:", fetch_response)

   # Get the existing hash_codes that are already in the Pinecone index
   existing_hash_codes = set(fetch_response.get('vectors', {}).keys())  # Extract existing IDs from the response
   print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

   # Filter out the docs that have already been added to Pinecone
   filtered_docs = [doc for doc in docs if doc.metadata['hash_code'] not in existing_hash_codes]
   print("2 -----------> Filtered Docs:", len(filtered_docs))

   return filtered_docs

Then tried another approach:

def filter_existing_docs(index_name, docs):
# Initialize the Pinecone index
index = pinecone_client.Index(index_name)

# Extract hash_codes from the docs list using the appropriate method for your Document objects
hash_codes = [doc.metadata['hash_code'] for doc in docs]  # Accessing 'metadata' if it's an attribute
print("Hash Codes:", hash_codes)

# We need to query Pinecone using `top_k` and search through the index
query_response = index.query(
    top_k=100,  # Set a suitable `top_k` to return a reasonable number of documents
    include_metadata=True,
    #namespace=namespace
)

# Debug: Print the query response to see its structure
print("Query Response:", query_response)

# Extract the hash_codes of the existing documents in Pinecone
existing_hash_codes = {item['metadata']['hash_code'] for item in query_response['matches']}
print("1 -----------> Existing Hash Codes:", len(existing_hash_codes))

# Filter out the docs that have already been added to Pinecone based on hash_code
filtered_docs = [doc for doc in docs if str(doc.metadata['hash_code']) not in existing_hash_codes]
print("2 -----------> Filtered Docs:", len(filtered_docs))

return filtered_docs

2 Answers 2

0
  1. Iterate through the new document hash codes
  2. Query Pinecone using each hash as a metadata filter
  3. If there are 0 results, add the corresponding file. Else, the file is already present, so skip it.
Sign up to request clarification or add additional context in comments.

Comments

0

You can create a naming convention for each chunk, like "doc1#hash", "doc2#hash"
You can also filter records based on id prefixes, eg

    for ids in index.list(prefix='doc1#', namespace=''):
    print(ids)

You can use any prefix pattern you like, but make sure you use a consistent prefix pattern for all child records of a document.

ex:

    doc1#chunk1
    doc1_chunk1
    doc1___chunk1
    doc1:chunk1
    doc1chunk1

Refrence : pinecone-docs/id-prefixes

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.