SaulLM-7B: A pioneering Large Language Model for Law

Posted Mar 9, 2024 Updated Mar 26, 2024

By Devansh Amin 2 min read

A large language model (LLM) tailored for understanding and processing legal documents.

Paper	arXiv
Models	Equall/Saul-Base Equall/Saul-Instruct-v1
Datasets	Equall/legalbench_instruct Equall/perplexity_evaluation
Company	Equall.ai

Summary

7 billion parameters LLM based on Mistral architecture trained on English legal corpus of over 30 billion tokens for legal text comprehension and generation.

Lack of LLMs for legal domain
By pretraining a LLM on dedicated legal corpora, the LLM will be able to comprehend the complexities of legal documents but also to adapt to the evolving nature of legal discourse.

Mistral continued pretraining on legal corpus of 30 billion tokens.
- Dataset Composition
- Replay Sources:
  - Reduce the risk of catastrophic forgetting by incorporating data from the prior training distribution.
  - Data from Wikipedia, StackExchange, and GitHub, comprising roughly 2% of the final training mix sampled from SlimPajama was included.
- Instruction Sources:
  - Authors found inclusion of conversational data during pretraining to be beneficial.
  - Data from Super Natural Instruction and FLAN was included.
Support user requests and conversational interaction by instruction fine-tuning on a dataset comprising of 600K instructions. The dataset involves 2 key components:
- Generic (non-legal) instructions: SlimOrca, MetaMathQA, UltraChat and Glaive Code Assistant v2
- Legal instructions: Multi-turn conversations generated using Mistral-7B-instruct model i.e., [User] Legal text 🡆 [Assistant] Response 🡆 [User] Provide reasoning 🡆 [Assistant] ...
Additional step of aligning the model with human preference was not required due to lack of meaningful improvement in performance.

Research Papers

llm mistral legal-domain machine-learning

This post is licensed under CC BY 4.0 by the author.