Build a Chatbot with Local LLM (Falcon 7B) and LangChain

Can you achieve ChatGPT-like performance with a local LLM on a single GPU?

Venelin Valkov
8 min readJul 16

--

Mostly, yes! In this tutorial, we’ll use Falcon 7B with LangChain to build a chatbot that retains conversation memory. We can achieve decent performance by utilizing a single T4 GPU and loading the model in 8-bit (~6 tokens/second). We’ll also explore techniques to improve the output quality and speed, such as:

  • Stopping criteria: detect the start of LLM “rambling” and stop the generation
  • Cleaning output: sometimes LLMs output strange/additional tokens, I’ll show you how you can clear those from the output
  • Store chat history: we’ll use memory to make sure your LLM remembers the conversation history

In this part, we will be using Jupyter Notebook to run the code. If you prefer to follow along, you can find the notebook on GitHub: GitHub Repository

Read the full tutorial on MLExpert.io

Read More

Setup

Let’s start by installing the required dependencies:

!pip install -Uqqq pip --progress-bar off
!pip install -qqq bitsandbytes==0.40.0 --progress-bar off
!pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq transformers==4.30.0 --progress-bar off
!pip install -qqq accelerate==0.21.0 --progress-bar off
!pip install -qqq xformers==0.0.20 --progress-bar off
!pip install -qqq einops==0.6.1 --progress-bar off
!pip install -qqq langchain==0.0.233 --progress-bar off

Here’s the list of required imports:

import re
import warnings
from typing import List

import torch
from langchain import PromptTemplate
from langchain.chains import ConversationChain
from…

--

--