VectorStore/Q&A, learn more¶
NOTE: this uses Cassandra's experimental "Vector Search" capability. Make sure you are connecting to a vector-enabled database for this demo.
In the previous Quickstart, you have created the index and at the same time added the corpus of text to it.
In most cases, these two operations happen at different times: besides, often new documents keep being ingested.
This notebook demonstrates further interactions you can have with a Cassandra Vector Store.
It is assumed you have run the Quickstart notebook (so that the vector store is not empty)
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
The setup is similar to the one you saw:
from langchain.vectorstores.cassandra import Cassandra
from cqlsession import getCQLSession, getCQLKeyspace
cqlMode = 'astra_db' # 'astra_db'/'local'
session = getCQLSession(mode=cqlMode)
keyspace = getCQLKeyspace(mode=cqlMode)
Below is the logic to instantiate the LLM and embeddings of choice. We choose to leave it in the notebooks for clarity.
from llm_choice import suggestLLMProvider
llmProvider = suggestLLMProvider()
# (Alternatively set llmProvider to 'VertexAI', 'OpenAI' ... manually if you have credentials)
if llmProvider == 'VertexAI':
from langchain.llms import VertexAI
from langchain.embeddings import VertexAIEmbeddings
llm = VertexAI()
myEmbedding = VertexAIEmbeddings()
print('LLM+embeddings from VertexAI')
elif llmProvider == 'OpenAI':
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
llm = OpenAI(temperature=0)
myEmbedding = OpenAIEmbeddings()
print('LLM+embeddings from OpenAI')
else:
raise ValueError('Unknown LLM provider.')
LLM+embeddings from OpenAI
Note: for the time being you have to explicitly turn on this experimental flag on the cassio
side:
import cassio
cassio.globals.enableExperimentalVectorSearch()
Re-use an existing Vector Store¶
Creating this Cassandra
vector store, it will re-connect with the existing data on DB.
In practice, you are loading an existing, pre-populated vector store for further usage.
(make sure you are using the very same embedding function every time! In fact, this is why we have a separate table for each embedding function, i.e. for each llmProvider
.)
myCassandraVStore = Cassandra(
embedding=myEmbedding,
session=session,
keyspace=keyspace,
table_name='vs_test1_' + llmProvider,
)
You can then re-create the index
from the vector store with:
index = VectorStoreIndexWrapper(vectorstore=myCassandraVStore)
and use it as you saw in the quickstart:
query = "Who is Luchesi?"
index.query(query, llm=llm)
" Luchesi is a friend of Fortunato's who has a critical turn and is known for his taste in wine."
Further usage of the vector store¶
These are some of the ways you can query the store:
myCassandraVStore.similarity_search_with_score(
"Does anyone have a coughing fit?",
k=1,
)
[(Document(page_content='"Nitre," I replied. "How long have you had that cough?"\n\n"Ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh! ugh! ugh!--ugh!\nugh! ugh!"\n\nMy poor friend found it impossible to reply for many minutes.\n\n"It is nothing," he said, at last.', metadata={'source': 'texts/amontillado.txt'}), 0.9052027836057179)]
Adding new documents¶
Start with a very off-topic question, to demonstrate that no relevant documents are found (yet).
Note: depending on the embedding function, you might still see some results, off-topic in practice, being found at this stage. In a full end-to-end Q&A session, however, these would likely be discarded by the LLM, which would presumably end up saying, "I don't know".
SPIDER_QUESTION = 'Compare Agelenidae and Lycosidae'
myCassandraVStore.similarity_search_with_relevance_scores(
SPIDER_QUESTION,
k=1,
score_threshold=0.8,
)
[(Document(page_content='"A huge human foot d\'or, in a field azure; the foot crushes a serpent\nrampant whose fangs are imbedded in the heel."\n\n"And the motto?"\n\n"_Nemo me impune lacessit_."\n\n"Good!" he said.', metadata={'source': 'texts/amontillado.txt'}), 0.8635578679283822)]
You can add a couple of relevant paragraphs to the index, using the add_texts
primitive:
spiderFacts = [
"""
The Agelenidae are a large family of spiders in the suborder Araneomorphae.
The body length of the smallest Agelenidae spiders are about 4 mm (0.16 in), excluding the legs,
while the larger species grow to 20 mm (0.79 in) long. Some exceptionally large species,
such as Eratigena atrica, may reach 5 to 10 cm (2.0 to 3.9 in) in total leg span.
Agelenids have eight eyes in two horizontal rows of four. Their cephalothoraces narrow
somewhat towards the front where the eyes are. Their abdomens are more or less oval, usually
patterned with two rows of lines and spots. Some species have longitudinal lines on the dorsal
surface of the cephalothorax, whereas other species do not; for example, the hobo spider does not,
which assists in informally distinguishing it from similar-looking species.
""",
"""
Jumping spiders are a group of spiders that constitute the family Salticidae.
As of 2019, this family contained over 600 described genera and over 6,000 described species,
making it the largest family of spiders at 13% of all species.
Jumping spiders have some of the best vision among arthropods and use it
in courtship, hunting, and navigation.
Although they normally move unobtrusively and fairly slowly,
most species are capable of very agile jumps, notably when hunting,
but sometimes in response to sudden threats or crossing long gaps.
Both their book lungs and tracheal system are well-developed,
and they use both systems (bimodal breathing).
Jumping spiders are generally recognized by their eye pattern.
All jumping spiders have four pairs of eyes, with the anterior median pair
being particularly large.
""",
]
spiderMetadatas = [
{'source': 'wikipedia/Agelenidae'},
{'source': 'wikipedia/Salticidae'},
]
myCassandraVStore.add_texts(
spiderFacts,
spiderMetadatas,
)
['c35b450d84e94cef37de6a934da51860', '03dcc418d50ee4c61bebaa92f6ee8005']
Another way is to add a text through LangChain's Document
abstraction.
Note that, using one of LangChain's splitters, long input documents are made into (possibly overlapping) digestible chunks without much boilerplate:
mySplitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=120)
lycoText = """
Wolf spiders are members of the family Lycosidae.
They are robust and agile hunters with excellent eyesight.
They live mostly in solitude, hunt alone, and usually do not spin webs.
Some are opportunistic hunters, pouncing upon prey as they
find it or chasing it over short distances;
others wait for passing prey in or near the mouth of a burrow.
Wolf spiders resemble nursery web spiders (family Pisauridae),
but wolf spiders carry their egg sacs by attaching them to their spinnerets,
while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.
Two of the wolf spider's eight eyes are large and prominent;
this distinguishes them from nursery web spiders,
whose eyes are all of roughly equal size.
This can also help distinguish them from the similar-looking grass spiders.
"""
lycoDocument = Document(
page_content=lycoText,
metadata={'source': 'wikipedia/Lycosidae'}
)
Use the splitter to "shred" the input document:
lycoDocs = mySplitter.transform_documents([lycoDocument])
lycoDocs
[Document(page_content='Wolf spiders are members of the family Lycosidae.\nThey are robust and agile hunters with excellent eyesight.\nThey live mostly in solitude, hunt alone, and usually do not spin webs.\nSome are opportunistic hunters, pouncing upon prey as they', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='Some are opportunistic hunters, pouncing upon prey as they\nfind it or chasing it over short distances;\nothers wait for passing prey in or near the mouth of a burrow.', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='Wolf spiders resemble nursery web spiders (family Pisauridae),\nbut wolf spiders carry their egg sacs by attaching them to their spinnerets,\nwhile the Pisauridae carry their egg sacs with their chelicerae and pedipalps.', metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content="while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.\nTwo of the wolf spider's eight eyes are large and prominent;\nthis distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.", metadata={'source': 'wikipedia/Lycosidae'}), Document(page_content='this distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.\nThis can also help distinguish them from the similar-looking grass spiders.', metadata={'source': 'wikipedia/Lycosidae'})]
These are ready to be added to the index:
myCassandraVStore.add_documents(lycoDocs)
['078fb9e67d9ed9415d9ef7d1779f7e5d', '32acf980292dac94d9e0cdab6a1f05b5', 'c0a279086e100f559b2fc59213312076', '4a6305de53b6adec0f5e164c1f3856a0', '902d340a12bfbb5756c15296d4a7bb49']
Querying the store again¶
Time to repeat the question:
myCassandraVStore.similarity_search_with_relevance_scores(
SPIDER_QUESTION,
k=3,
score_threshold=0.8,
)
[(Document(page_content='\n The Agelenidae are a large family of spiders in the suborder Araneomorphae.\n The body length of the smallest Agelenidae spiders are about 4 mm (0.16 in), excluding the legs,\n while the larger species grow to 20 mm (0.79 in) long. Some exceptionally large species,\n such as Eratigena atrica, may reach 5 to 10 cm (2.0 to 3.9 in) in total leg span.\n Agelenids have eight eyes in two horizontal rows of four. Their cephalothoraces narrow\n somewhat towards the front where the eyes are. Their abdomens are more or less oval, usually\n patterned with two rows of lines and spots. Some species have longitudinal lines on the dorsal\n surface of the cephalothorax, whereas other species do not; for example, the hobo spider does not,\n which assists in informally distinguishing it from similar-looking species.\n ', metadata={'source': 'wikipedia/Agelenidae'}), 0.9030032731922986), (Document(page_content="while the Pisauridae carry their egg sacs with their chelicerae and pedipalps.\nTwo of the wolf spider's eight eyes are large and prominent;\nthis distinguishes them from nursery web spiders,\nwhose eyes are all of roughly equal size.", metadata={'source': 'wikipedia/Lycosidae'}), 0.9006974734018747), (Document(page_content='Wolf spiders resemble nursery web spiders (family Pisauridae),\nbut wolf spiders carry their egg sacs by attaching them to their spinnerets,\nwhile the Pisauridae carry their egg sacs with their chelicerae and pedipalps.', metadata={'source': 'wikipedia/Lycosidae'}), 0.8930485206731311)]
Item removal and expiration¶
Time-To-Live (TTL)¶
If you provide a TTL value when creating the store, every entry will expire away a certain time after its insertion:
myCassandraVStoreWithTTL = Cassandra(
embedding=myEmbedding,
session=session,
keyspace=keyspace,
table_name='vs_test1_' + llmProvider,
ttl_seconds=10,
)
The following two documents will be available for ten seconds.
myCassandraVStoreWithTTL.add_documents(lycoDocs[0:2])
['078fb9e67d9ed9415d9ef7d1779f7e5d', '32acf980292dac94d9e0cdab6a1f05b5']
Alternatively, for a finer control of the time-to-live, you can specify it at insertion time -- which would anyway have precedence over the store-level definition. So, these documents will survive for twenty seconds:
myCassandraVStore.add_documents(lycoDocs[2:], ttl_seconds=20)
['c0a279086e100f559b2fc59213312076', '4a6305de53b6adec0f5e164c1f3856a0', '902d340a12bfbb5756c15296d4a7bb49']
Manual removal of entries¶
You can delete individual documents from the store.
However, you first need to retrieve their identifier with a similarity search. The following method returns a list of matching 3-tuples, whose last item is the id of the document:
spiderDocIds = []
for doc, score, docId in myCassandraVStore.similarity_search_with_score_id('Compare Agelenidae and Lycosidae'):
print(f' * [{score:.3f}] "{doc.page_content[:32].strip()}..." ({docId})')
spiderDocIds.append(docId)
* [0.903] "The Agelenidae are a large..." (c35b450d84e94cef37de6a934da51860) * [0.901] "while the Pisauridae carry their..." (4a6305de53b6adec0f5e164c1f3856a0) * [0.893] "Wolf spiders resemble nursery we..." (c0a279086e100f559b2fc59213312076)
At this point you can perform the actual document deletion:
for spiderDocId in spiderDocIds:
myCassandraVStore.delete_by_document_id(spiderDocId)
The last method to remove entries from the store is demonstrated next.
Cleanup¶
You're done.
In order to leave the index empty for the next demo run, you may want to clean the index (i.e. empty the table on DB).
Just don't take this operation lightly in production!
myCassandraVStore.clear()