### Multi-criteria Optimization
Consideration of additional factors beyond similarity, such as document quality, recency, and 'authoritativeness'.
### User Feedback
Incorporate user feedback into the retrieval process. For example, if a user clicks on a document, it can be re-ranked higher in future searches.
### Diversification
Diversify the search results by ensuring that the retrieved documents cover a wide range of topics.
### Query Expansion & Rephrasing
For example, if a user asks about "academic integrity", the system could expand the query to include related terms like "plagiarism" and "cheating". This will help retrieve more relevant documents.
## Generate
Once we have retrieved the relevant chunks based on a query, we can generate text using a large language model. Large language models can be used for many tasks -- including text classification, text summarization, question-answering, multi-modal tasks, and more.
There are many large language models available at platforms like:
- [OpenAI GPT-4o](https://platform.openai.com/)
- [Google Gemini](https://ai.google.dev/gemini-api/docs?gad_source=1&gclid=CjwKCAiAudG5BhAREiwAWMlSjKXwuvq9JRRX0xxXaS7yCSn-NWo3e4rso3D-enl2IblIH09phtCvSxoCJhoQAvD_BwE)
- [Anthropic Claude](https://claude.ai/)
- [HuggingFace (Many)](https://huggingface.co/)
- ...
Raschka, 2023
## Generate | Transformers
Large language models (LLMs) are powerful tools for text generation because they can generate coherent and contextually relevant text.
**Encoder-Decoder Models**: T5, BART.
Encoder-decoder models generate text by encoding the input text into a fixed-size vector and then decoding the vector into text. Used in machine translation and text summarization.
**Encoder-Only**: BERT
Encoder-only models encode the input text into a fixed-size vector. These models are powerful for text classification tasks but are not typically used for text generation.
**Decoder-Only**: GPT-4, GPT-3, Gemini
Autoregressive models generate text one token at a time by conditioning on the previous tokens. Used in text generation, language modeling, and summarization.
Vaswani, 2017
## Transformers | Self-Attention
Self-attention is a mechanism that allows the model to weigh the importance of different words in a sequence when generating text. It computes attention scores for each word based on its relationship with other words in the sequence. It does this by computing three vectors for each word: the query (q), key (k), and value (v) vectors.
**Query (q)**: Represents the word for which we are computing attention.
**Key (k)**: Represents the words we are attending to.
**Value (v)**: Represents the information we want to extract from the attended words.
Jung 2021
## Transformers | Self-Attention
The attention score is computed as the dot product of the query and key vectors, followed by a softmax operation to normalize the scores. Value vectors are then weighted by these scores to produce the final output.
$$ \text{Attention}(q, k, v) = \text{softmax}\left(\frac{q \cdot k^T}{\sqrt{d_k}}\right) v $$
where:
- $q$ is the query vector
- $k$ is the key vector
- $v$ is the value vector
- $d_k$ is the dimension of the key vector
Foster 2024
## Transformers | Self-Attention
Foster 2024
## Transformers | Multi-Head Attention
Multi-head attention is an extension of the self-attention mechanism that allows the model to focus on different parts of the input sequence simultaneously. It does this by using multiple sets of query, key, and value vectors, each with its own learned parameters.
## Transformers | Architecture
Vashwani 2017
## Generate | GPT-4 & OpenAI API
What really sets OpenAI apart is their extremely useful and cost-effective API. This puts their LLM in the hands of users with minimal effort.
```python
import openai
openai_client = openai.Client(api_key = os.environ['API_KEY'])
response = openai_client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hi, GPT-4!"}
]
)
```
## Putting it All Together
Now that we have discussed the components of Retrieval-Augmented Generation (RAG), let's use what we have learned to build an expert chatbot that can answer questions about Northwestern's policy on academic integrity.
NVIDIA, 2023
## Putting it All Together | Demo Copied Here
```python[1-10 | 12-13 | 15-16 | 18-19 | 21-23 | 25-26 | 28 - 39 | 40 - 42 | 44-46 | 48 - 51]
import os
import chromadb
import openai
from chromadb.utils import embedding_functions
from nltk import sent_tokenize
# Initialize clients.
chroma_client = chromadb.Client()
openai_client = openai.Client(api_key = os.environ['API_KEY'])
# Create a new collection.
collection = chroma_client.get_or_create_collection('academic_integrity_nw')
# Load Academic Integrity document.
doc = open('/Users/joshua/Desktop/academic_integrity.md').read()
# Chunk the document into sentences.
chunked_data = sent_tokenize(doc)
# Embed the chunks.
embedding_function = embedding_functions.OpenAIEmbeddingFunction(model_name="text-embedding-ada-002", api_key=os.environ['API_KEY'])
embeddings = embedding_function(chunked_data)
# Store embeddings in ChromaDB.
collection.add(embeddings = embeddings, documents = chunked_data, ids = [f"id.{i}" for i in range(len(chunked_data))])
# Create a system prompt template.
SYSTEM_PROMPT = """
You are an expert in academic integrity at Northwestern University. You will provide a response
to a student query using exact language from the provided relevant chunks of text.
RELEVANT CHUNKS:
{relevant_chunks}
"""
# Get user query.
user_message = "Can I appeal?"
print("User: " + user_message)
# Get relevant documents from chromadb.
relevant_chunks = collection.query(query_embeddings = embedding_function([user_message]), n_results = 2)['documents'][0]
print("Retrieved Chunks: " + str(relevant_chunks))
# Send query and relevant documents to GPT-4.
system_prompt = SYSTEM_PROMPT.format(relevant_chunks = "\n".join(relevant_chunks))
response = openai_client.chat.completions.create(model="gpt-4", messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_message}])
print("RAG-GPT Response: " + response.choices[0].message.content)
```
```text
User: Can a student appeal?
Retrieved Chunks: ['A student may appeal any finding or sanction as specified by the school holding jurisdiction.', '6. Review of any adverse initial determination, if requested, by an appeals committee to whom the student has access in person.']
RAG-GPT Response: Yes, a student may appeal any finding or sanction as specified by the school holding jurisdiction.
```
## Summary
Today we discussed Retrieval-Augmented Generation (RAG), a modern NLP approach that combines the strengths of information retrieval systems with large language models. Building a RAG system exposed us to many critical concepts in NLP, including:
1. **Tokenization**: Breaking text into smaller units called tokens.
2. **Chunking**: Creating windows of text that can be indexed and searched.
3. **Embedding**: Representing text as dense vectors in a continuous vector space.
4. **Storage & Retrieval**: Storing embeddings in a vector database and retrieving relevant documents based on their similarity to a query.
5. **Generation**: Generating text using a large language model.
# Exit Poll
## On a scale of 1-5, how confident are you with **text** methods such as:
- Regular Expressions
- Tokenization
- Bag of Words (BoW)
- Term Frequency-Inverse Document Frequency (TF-IDF)
- Markov Chains
- Evaluation of Text Generation Models
- R.A.G.