Talk to Your Data Using GPT-3.5
06 Oct 2023Demo https://qmaruf-talk-to-data.hf.space
Have you ever contemplated the possibility of engaging in a conversation with your data? Imagine conversing with a chatbot that possesses comprehensive knowledge about your dataset. This intriguing problem is the focus of this note, where we explore how to achieve it using the ChatGPT API.
To illustrate this concept, we will employ the first book of the Harry Potter series, “Harry Potter and the Philosopher’s Stone,” as our dataset. Our goal is to engage in a conversation with the content of the book. To facilitate this, we will utilize LangChain, a powerful tool for parsing and interacting with data, in conjunction with the ChatGPT API.
Our first step is to load the relevant data from the book. We will use the TextLoader from LangChain to achieve this:
from langchain.document_loaders import TextLoader
book_txt = 'docs/potter1.txt'
loader = TextLoader(book_txt)
docs = loader.load()
Next, we need to break down the text into manageable chunks. This allows us to work with smaller sections of the book at a time. We define the chunk_size and chunk_overlap parameters for this purpose:
chunk_size = 1000
chunk_overlap = 250
In the next step, we will create a text splitter based on RecursiveCharacterTextSplitter
. This text splitter is ideal for generic content and uses a character parameter list for segmentation. It sequentially applies these characters to divide text into appropriately sized chunks. It ensures that paragraphs, sentences, and words stay together for maximum semantic coherence.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len
)
splits = text_splitter.split_documents(docs)
Here length_function
measures the length of given chunks. We will use len
as our length_function
for this example.
The split_documents
method will return a list of Document
objects. For example, here is the first Document
object from the list. We can print this content using the print method: print(splits[0])
Document(page_content="Harry Potter and the Sorcerer's Stone\n\n\nCHAPTER ONE\n\nTHE BOY WHO LIVED\n\nMr. and Mrs. Dursley, of number four, Privet Drive, were proud to say\nthat they were perfectly normal, thank you very much. They were the last\npeople you'd expect to be involved in anything strange or mysterious,\nbecause they just didn't hold with such nonsense.\n\nMr. Dursley was the director of a firm called Grunnings, which made\ndrills. He was a big, beefy man with hardly any neck, although he did\nhave a very large mustache. Mrs. Dursley was thin and blonde and had\nnearly twice the usual amount of neck, which came in very useful as she\nspent so much of her time craning over garden fences, spying on the\nneighbors. The Dursleys had a small son called Dudley and in their\nopinion there was no finer boy anywhere.", metadata={'source': 'docs/potter1.txt'})
In this step, we will create a vector database to store the embeddings of the chunks. For any query, we search the vector database and extract the most similar chunks to the query. We will use Chroma vector db for this example.
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
persist_directory = 'docs/chroma'
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
persist_directory=persist_directory
)
Here embedding
is the OpenAI embedding function. persist_directory
is the directory to store the embeddings.
We can now search the vector database using vectordb.max_marginal_relevance_search
(MMR) function. MMR returns chunks selected using the maximal marginal relevance. Maximal marginal relevance optimizes for similarity to query and diversity among selected documents. It takes the query string and returns the k
most similar chunks. We will use k=5
for this example.
query = "Write the names of all of Harry Potter's teachers."
answers = vectordb.max_marginal_relevance_search(query, k=10)
Here answers
contain the k
most similar chunks to the query.
We can check all the chunks using a for
loop. Here is the first answer from the list.
Professor Flitwick, the Charms teacher, was a tiny little wizard who had to stand on a pile of books to see over his desk. At the start of their first class, he took the roll call, and when he reached Harry's name, he gave an excited squeak and toppled out of sight. Professor McGonagall was again different. Harry had been quite right to think she wasn't a teacher to cross. Strict and clever, she gave them a talking-to the moment they sat down in her first class.
Up to this point, we have created the vector database and searched the database using a query for relevant documents. Now we will use the ChatGPT API to chat with the content of the book. We will use answers
as chat context.
Now using langchain
, we will create a ChatOpenAI
model to interact with the ChatGPT API.
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
Here model_name
is the name of the model. We will use gpt-3.5-turbo
for this example. temperature
will control the randomness of the chatbot response.
We will also define a retriever
to extract the most similar chunks to the query.
retriever = vectordb.as_retriever(search_type="mmr", search_kwargs={'k': 10, 'fetch_k': 50})
Now we will create a RetrievalQA
chain using the llm
and retriever
.
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever
)
Here qa_chain
combines a retriever and a language model to retrieve relevant documents for a query and answer questions based on those documents.
Let’s check how qa_chain
performs for a query. We will use the same query that we used earlier.
response = qa_chain({"query": query})
print(response['result'])
Professor Flitwick, Professor McGonagall, Professor Sprout, Professor Binns, Professor Snape, Madam Hooch
We can use a custom prompt to tell qa_chain
how we want the answer. Here it is:
from langchain.prompts import PromptTemplate
template = """Use only the following context to answer the question at the end. Always say "thanks for asking!" at the end of the answer. Write a summary of the question and then give the answer.
Context: {context}
Question: {question}
Answer:
{context}
Question: {question}
Answer:"""
qa_chain_prompt = PromptTemplate.from_template(template)
We will now fit the template into the qa_chain
and check the result.
qa_chain = RetrievalQA.from_chain_type(
llm,
retriever=retriever,
chain_type_kwargs={"prompt": qa_chain_prompt}
)
response = qa_chain({"query": query})
print(response['result'])
The names of all of Harry Potter's teachers are Professor Flitwick, Professor McGonagall, Professor Binns, Professor Snape, Madam Hooch, and Hagrid. Thanks for asking!
qa_chain
is able to understand the context of the query and give a reasonable answer.
Until now, qa_chain
has no memory. That means we can’t ask any question based on the previous answer. We will use ConversationBufferMemory
to create a new type of chain that can remember the previous conversation. Let’s define memory
as an instance of ConversationBufferMemory
and use it to create a new chain named ConversationalRetrievalChain
.
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
memory_key='chat_history',
return_messages=True
)
Here’s how to create ConversationalRetrievalChain
using memory
, vectordb
, and llm
.
qa_chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=vectordb.as_retriever(),
memory=memory
)
We will ask three related questions to qa_chain
and check the result.
q1 = "Write the names of all of Harry Potters Teachers."
q2 = "sort the name of the teacher based on how frequently they are mentioned"
q3 = "tell me more about this professor"
for q in [q1, q2, q3]:
response = qa({'question': q})
print (f'Q: {q}')
print (f'A: {response["answer"]}')
print ('\n')
Q: Write the names of all of Harry Potters Teachers.
A: The names of Harry Potter's teachers mentioned in the given context are:
1. Professor Flitwick (Charms teacher)
2. Professor McGonagall (unknown subject)
3. Professor Sprout (Herbology teacher)
4. Professor Binns (History of Magic teacher)
5. Professor Snape (Potions teacher)
6. Madam Hooch (unknown subject)
7. Professor Quirrell (unknown subject)
Please note that there may be other teachers at Hogwarts that are not mentioned in
this context.
Q: sort the name of the teacher based on how frequently they are mentioned
A: Professor McGonagall is mentioned most frequently in the given context.
Q: tell me more about this professor
A: Professor McGonagall is described as strict and clever. She is a teacher at
Hogwarts School of Witchcraft and Wizardry and teaches Transfiguration, which
is described as complex and dangerous magic. She gives the students a talking-to
in her first class, emphasizing the importance of taking her class seriously.
She is also shown to be observant and recognizes Harry's talent as a Seeker in
Quidditch, recommending him to the Gryffindor team captain. Additionally, she
is a member of the staff and is seen interacting with other professors, such
as Professor Flitwick.
Well, it looks like qa_chain
is able to remember the previous conversation and answer the questions based on the previous conversation.
This is the end of this note. We have seen how to use langchain
to create a chatbot that can chat with the content of a book. We have also seen how to create a chatbot that can remember the previous conversation.