Lance Martin

TL;DR

I built an app for question-answering over the full history of Lex Fridman podcasts. It uses Whisper for audio-to-text followed by Langchain for dataset processing and embedding. It uses Pinecone to store embeddings and Langchain vectorDB search to find relevant podcast clips given a user question. It uses UI elements inspired by Mckay Wrigley’s work. Code is here.

Workflow

I used Karpathy’s Whisper transcriptions (large model) of 325 Lex Fridman podcasts and generated the remainder (up to episode 365) myself with the medium model. I used Langhchain to embed them, and stored embeddings in a Pinecone vectorDB. I built a UI inspired by Wait-But-Why GPT, which uses Langchain to wraps API calls for question-answering using ChatGPT.

Untitled


Observations

Split size

Split size has a strong influence on performance. To quantify this, I take the Karpathy podcast episode and use Langchain QAGenerationChain to generate an eval set. I then split the lecture into VectorDBN indexes using various chunk sizes and perform evaluation using Langchain QAEvalChain. Two example QA pairs are shown below.

[{'question': 'What is the transformer architecture in deep learning?',
'answer': 'The transformer architecture is a neural network architecture that is general purpose and can process different sensory modalities like vision, audio, text, and video. It is simultaneously expressive in the forward pass, optimizable via backpropagation, gradient descent, and efficient high parallelism compute graph.'},
{'question': 'What is a transformer and how is it designed?',
'answer': 'A transformer is a series of blocks with attention and a multilayer perceptron. It is designed to be very expressive in a forward pass, optimizable in a backward pass, and efficient in hardware. The residual connections support the ability to learn short algorithms fast and first, and then gradually extend them longer during training.'},

The resulting performance with respect to chunk size can be seen below, showing a fairly large swing in performance with respect to chunk size. Below I show some specific examples.

Untitled

I also tested Llama-index (PR here) for the QA task on this same eval set with max chunk size = 512 and also default chunk size (far right). I will need to explore these results further.

Untitled

In the below case we look at answer quality on the full lex-gpt index split between 1.5k to 2k chunks for the question: What does the Gato model do?