Large language models (LLMs) are trained on massive, publicly available text datasets comprising trillions of tokens, enabling them to excel at general language tasks like next-token prediction. However, LLMs often struggle with domain-specific prompts, exhibiting reduced accuracy or generating inaccurate information (hallucinations). This is because they lack sufficient subject matter expertise. Two primary approaches exist to address this limitation for augmenting LLMs knowledge: Retrieval-Augmented Generation (RAG) and fine-tuning. This presentation focuses on fine-tuning smaller LLMs with domain-specific instruct datasets using the LoRA (Low-Rank Adaptation) technique on Gaudi hardware. We will leverage publicly available LLMs and datasets from the Hugging Face Hub for this demonstration. Though it is possible to fine tune LLMs with plain text data – sourced from documents, articles, and other materials.
Fine Tuning Large Language Models (LLMs) with Domain Specific Datasets
Remote event
Instructor
Madhusudan Gujral
Bioinformatics Lead, SDSC
Madhusudan Gujral is currently a bioinformatics lead at SDSC. His background is in structural biology, but he transitioned to the field of informatics over 20 years ago. He began by developing a client-based laboratory information system (LIM) for a distributed biological project with users across the US. This was followed by a large project creating complex pipelines for metagenomics research. He then spent a decade processing and analyzing whole genome sequencing (WGS) data from thousands of samples collected from patients with psychiatric disorders. For the past two years, he has focused on learning and benchmarking fine-tuning large language models (LLMs) on Gaudi hardware.