Langchain for paper summarization

TL;DR

I used Langchain to create an app that summarizes papers in ~100 lines of code, which shows that it’s pretty easy to use Langchain as a general LLM programming framework.

Motivation

I’ve always wanted a tool to summarize papers, especially given the rapid pace of new AI papers always coming out. I decided to build this as a testbed to try a few LLM tools: @hwchase17 and @_Brian_Raymond have a nice write-up about the tech stack for LLMs, explaining that Langchain is emerging as a LLM programming framework. Below I’ll explain how I used it for this task.

Flow for a paper distillation app: pre-process PDFs, embed them, search them given a query, summarize

The below code snippets show how Langchain wraps key components in this workflow. The central idea is to first embed papers and then perform semantic search over embeddings given the question. Finally, we pass these relevant chunks to the LLM for summarization into an answer.

Components

Pre-processing: This can be similar to paper-qa reader or LangChain PagedPDFSplitter.

chunks = split_pdf(pdf_path)

Embeddings / Vector Store: LangChain wraps access to various vector stores.

from langchain.vectorstores import FAISS
faiss_ix = FAISS.from_texts(chunks, OpenAIEmbeddings())

Similarity search: Similarity search is trivial (e.g., using FAISS index).

query = "What is the main innovation of the paper?"
relevant_chunks = faiss_ix.similarity_search(query, k=2)

LLM endpoint: LangChain wraps this (creates prompt, asks LLM for the summary).

chain = load_qa_chain(OpenAI(temperature=0),chain_type="stuff")
answer = chain.run(input_documents=relevant_chunks,question=query)

Results

I use a simple Streamlit app to put a UI over this. It’s ~100 lines of code (see here).

TL;DR

Motivation

Components

Results

Caveats