4

I ingested all docs and created a collection / embeddings using Chroma. I have a local directory db. Within db there is chroma-collections.parquet and chroma-embeddings.parquet. These are not empty. Chroma-collections.parquet when opened returns a collection name, uuid, and null metadata.

When I load it up later using langchain, nothing is here.

from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
CHROMA_SETTINGS = Settings(
        chroma_db_impl='duckdb+parquet',
        persist_directory='db',
        anonymized_telemetry=False
)

db = Chroma(persist_directory='db', embedding_function=embeddings, client_settings=CHROMA_SETTINGS)

db.get() returns {'ids': [], 'embeddings': None, 'documents': [], 'metadatas': []}

I've tried lots of other alternate approaches online. E.g.

import chromadb

client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",
                                    persist_directory='./db'))
coll = client.get_or_create_collection("langchain", embedding_function=embeddings)
coll.count() returns 0

I'm expecting all the docs and embeddings to be available. What am I missing?

2
  • directly remove the chroma_db_impl in chroma_settings. But I still meeting the problem that the database files didn't created after db.persist(). Commented Oct 27, 2023 at 3:07
  • another alternative is to downgrade the langchain to 0.0.322, chromadb==0.3.29, keep install duckdb==0.71 Commented Oct 27, 2023 at 4:16

5 Answers 5

3

We need to add collection_name while saving/loading Chromadb.

save to disk

db2 = Chroma.from_documents(docs, embedding_function,  persist_directory="./chroma_db", collection_name='v_db')
db2.persist()
docs = db2.similarity_search(query)

load from disk

db3 = Chroma(collection_name='v_db', persist_directory="./chroma_db", embedding_function)
docs = db3.similarity_search(query)
print(docs[0].page_content)
Sign up to request clarification or add additional context in comments.

Comments

1

It looks like the langchain dokumentation was wrong https://github.com/langchain-ai/langchain/issues/19807

You can change

from langchain_community.vectorstores import Chroma

to

from langchain_community.vectorstores.chroma import Chroma

1 Comment

Thanks buddy. I was looking all over the internet for the RC. The moment I switched to this package it stated working and now it all makes sense.
0

I got the problem too and found it is beacause my program ran chromadb in jupyter lab (or jupyter notebook which is the same).

In chromadb official git repo example, it says:

In a notebook, we should call persist() to ensure the embeddings are written to disk. This isn't necessary in a script - the database will be automatically persisted when the client object is destroyed.

So, If your program is also ran in jupyter env,the best way is to call client.persist() everytime when you need to save your modification to chromadb's local persistence. The example code is as follow:

import chromadb

client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",
                                    persist_directory='./db'))
coll = client.get_or_create_collection("langchain", embedding_function=embeddings)

... # any modifications on chromadb, include create, upsert, delete...

client.persist() # save modifications above to chroma's local persistence

Comments

0

if you use PersistentClient the collection will automatically be saved to the database on add or update or upsert

client = chromadb.PersistentClient("C:\\Users\me\\python_files\\python-deep-learning-master")

Comments

0

your question is posted 18 months ago and I just meet the same trouble today. MAY BE you have already solved it, but I still write my solution down here:

when you create a chroma database with something like this:

persist_folder = "D:\\collection"
vector_db2 = Chroma.from_documents(
    documents=split_docs,
    embedding=embeddings,
    persist_directory=persist_folder,
    collection_name="latest_knowledge",
)

And you load it with something like this:

vector_db = Chroma(persist_directory="D:\\collection", 
    embedding_function=embeddings,
    collection_name="latest_knowledge")

LOADING FAILED

You have to use:

vector_db = Chroma(persist_directory=persist_folder , 
    embedding_function=embeddings,
    collection_name="latest_knowledge")

See? You have to reference the persist folder with exactly SAME WAY(in a string variable OR a hardcoded string) when you create the DB and load the DB.

I didn't read the source code of langchain, But I guess the trouble roots in some bugs when they handle the parameter "persist_directory".

Hope this may help you.

1 Comment

I don't have enough reputation points to upvote but this saved me a ton of trouble! I would never guess there could be such a bug! Thank you very much for sharing this knowledge!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.