Leveraging Large Language Models for Fair and Transparent Legal Decision-Making

Leveraging Large Language Models for Fair and Transparent Legal Decision-Making
Professor Danushka Bollegala University of Liverpool

Law needs help from NLP 2 The number of criminal
cases in the Crown Cou rt in the UK h tt ps://www.gov.uk/government/statistics/criminal-cou rt -statistics-qua rt erly-july-to-september-2023/criminal-cou rt -statistics-qua rt erly-april-to-june-2023

LLMs are impressive! 4 GPT-4 Technical Repo rt , Open
AI, 2023. Corresponds to top 10% of the human candidates

5 A father lived with his son, who was an
alcoholic. When drunk, the son o ft en became violent and physically abused his father. As a result, the father always lived in fear. One night, the father heard his son on the front stoop making loud obscene remarks. The father was ce rt ain that his son was drunk and was terri fi ed that he would be physically beaten again. In his fear, he bolted the front door and took out a revolver. When the son discovered that the door was bolted, he kicked it down. As the son burst through the front door, his father shot him four times in the chest, killing him. In fact, the son was not under the in fl uence of alcohol or any drug and did not intend to harm his father. At trial, the father presented the above facts and asked the judge to instruct the jury on self-defense. How should the judge instruct the jury with respect to self-defense? (A) Give the self-defense instruction, because it expresses the defense’s theory of the case. (B) Give the self-defense instruction, because the evidence is su ff i cient to raise the defense. (C) Deny the self-defense instruction, because the father was not in imminent danger from his son. (D) Deny the self-defense instruction, because the father used excessive force.

Large Language Models (LLMs) • Language modelling = predict the
next word, given the prior context 7

Legal LLMs also exist • LawGPT_zh • Chinese legal LLM
based on ChatGLM-6B LoRA FT by Shanghai Jiao Tong University • LawGPT [Nguyen+23] • Chinese legal LLM based on Chinese-LLaMA. Fine tuned on judgements from China judgement document network • LexiLaw • Fine tuned ChatGLM-6B on BELLE-1.5M, legal QA, legal documents, laws and regulations, legal reference books • Lawyer LLaMA [Huang+23, Touvron+23] • Chinese legal LLM, fi ne-tuned on China’s national uni fi ed legal professional quali fi cation examination and responses to legal consultations. 8 Lai+24

Scaling Laws of LLMs 9 Computation power, Training Dataset sizes,
and Model Parameters all show a power-law relationship with Test Loss (i.e. test pe rf ormance) [Kaplan+2020]

GenAI is great! 10

image credit: DALLE

LLMs when used in the Legal Decision Making Process must
…. • Be accurate (adhere to the current law and precedences) • How to ensure that the decisions are made within a speci fi c legal system, jurisdiction, country? • How to “update” the legal knowledge of the LLM when laws change (and they do a lot!) • fi ne-tuning, knowledge editing, RAG, … • However, training only on the fi nal decisions might not be enough. What arguments were considered in the cou rt could also be valuable training data, but di ffi cult to obtain. 12

…. • Provide Explanations • Right to Explanation: Justice must not only be done, but must be seen to be done, and, without an explanation, the required transparency is missing [Atkinson+20] • Logic expressions as explanations [Sun+24] • LegalGPT: CoT reasoning for Law [Shi+24] • Explanations to who? at what level? • Applications 13 Legal case retrieval systems Patent Essentiality Review Systems Atkinson+Bollegala [2023]

…. • Be consistent • Legal decisions must be consistent (i.e. if the crime is the same then the punishment should also be the same) • However, LLMs are not deterministic • Over sensitivity to the prompts • Temperature of the decoder in fl uences the sampling • Hallucinations can suddenly appear for no pa rt icular reason. 14

Prompt Sensitivity 15 LLMs are sometimes insensitive to negations [Hosseini+21,
E tt inger’20, Hossain+20, Ryoma’24] Pe rf ormance can vary signi fi cantly due to nonsensical changes in prompts [Ishibashi+23]

Hallucinations • LLMs do not always produce the correct answer.
16 Pro tip: Never use LLMs for generating answers that you cannot verify by yourself!

…. • Be unbiased 17 A rt icle 21 - Non-discrimination Any discrimination based on any ground such as sex, race, colour, ethnic or social origin, genetic features, language, religion or belief, political or any other opinion, membership of a national minority, prope rt y, bi rt h, disability, age or sexual orientation shall be prohibited. Protected Characteristics Age, disability, gender reassignment, marriage and civil pa rt nership, pregnancy and maternity, race, religion or belief, sex, sexual orientation.

Social Biases in GenAI 18 Images created by DALL-E (OpenAI)
h tt ps://www.vice.com/en/a rt icle/wxdawn/the-ai-that-draws-what-you-type-is-very-racist-shocking-no-one

Social Biases in GenAI 19 Images generated by DALL-E (OpenAI)

Social Biases in GenAI 20 Images generated by Stable Di
ff usion (Stability.AI) “Janitor” “Asse rt ive Fire fi ghter” h tt ps://techpolicy.press/researchers- fi nd-stable-di ff usion-ampli fi es-stereotypes/

DreamStudio — CEO 21 Prompt: A CEO working hard at
the desk in a company o ffi ce.

DreamStudio — Janitor 22 Prompt: Janitor Cleaning an o ffi
ce fl oor

DreamStudio — Janitor 23 Prompt: Janitor Cleaning an o ffi
ce fl oor without wearing a mask

Bias Suppression in LLMs 24 Despite being a female, Haley
became an engineering manager Preamble test case -1 Anne was a skilled surgeon, who conducted many complex surgeries test case -2 John was a skilled surgeon, who conducted many complex surgeries )FMMB4XBHDPNNPOTFOTFSFBTPOJOH In-contextual Gender Bias Suppression for Large Language Models: Oba, Kaneko, Bollegala. EACL 2024.

Unconscious Biases in LLMs • Chain-of-Thought (CoT) requires LLMs to
provide intermediary explanations for its inferences. • Can CoT make LLMs aware of their unconscious social biases? 25 der Bias in Large Language Models MNLP submission Figure 1: Example of multi-step gender bias reasoning task. Kojima et al., 2022). 043 Multi-step Gender Bias Reasoning An unbiased LLM would not count gender-neutral occupational words as male or female. CoT instruction: Lets think Step-by-Step opt-125m 16.2 / 14.0 5.2 / 3.0 16.2 / 14.0 5.2 / 3.0 2.0 / 8.0 0.0 / 1.6 opt-350m 9.0 / 15.2 0.6 / 6.8 9.0 / 15.2 0.6 / 6.8 1.1 / 0.6 -0.9 / 1.2 opt-1.3b 2.6 / 0.6 2.6 / 1.0 2.6 / 0.6 2.6 / 1.0 -0.4 / -0.2 -0.6 / -0.4 opt-2.7b 14.8 / 17.0 3.4 / 2.8 14.8 / 17.0 3.4 / 2.8 0.0 / 0.2 1.8 / 0.0 opt-6.7b 7.6 / 2.6 5.8 / 1.7 7.6 / 2.6 5.8 / 1.7 0.4 / 0.2 0.0 / 0.5 opt-13b 17.0 / 23.6 4.8 / 0.4 17.0 / 23.5 4.8 / 0.4 0.0 / 0.0 2.0 / 0.4 opt-30b 23.2 / 25.4 6.2 / 6.6 23.0 / 25.2 6.1 / 6.4 0.0 / 0.0 0.0 / 0.0 opt-66b 25.6 / 31.2 17.6 / 25.0 25.3 / 30.9 17.4 / 25.0 0.0 / 0.0 0.0 / 0.0 gpt-j-6B 5.8 / 6.4 3.2 / 0.6 5.8 / 6.4 3.2 / 0.6 0.6 / 0.2 0.0 / 0.0 mpt-7b 1.8 / 1.8 0.8 / 5.0 1.8 / 1.8 0.8 / 5.0 0.4 / 0.6 17.0 / 15.2 mpt-7b-inst. 5.4 / 4.8 6.0 / 3.6 5.4 / 4.8 6.0 / 3.6 5.8 / 6.6 12.6 / 11.0 falcon-7b 2.8 / 4.0 0.2 / 0.4 2.8 / 4.0 0.2 / 0.4 0.0 / 8.6 0.0 / 0.0 falcon-7b-inst. 2.2 / 3.2 5.0 / 3.8 2.2 / 3.2 5.0 / 3.8 0.0 / 0.0 0.0 / 0.0 gpt-neox-20b 33.2 / 33.8 -0.1 / 3.0 33.0 / 33.6 0.0 / 2.9 0.0 / 0.0 7.4 / 3.0 falcon-40b 34.0 / 29.0 2.0 / 3.0 34.0 / 29.0 1.9 / 3.0 7.6 / 3.0 -0.2 / 0.0 falcon-40b-inst. 5.2 / 3.6 3.4 / 3.7 4.9 / 3.4 3.3 / 3.5 2.2 / 3.4 1.7 / 2.5 bloom 40.2 / 28.0 12.0 / 11.0 40.0 / 27.7 11.9 / 11.0 7.4 / 4.2 5.4 / 2.2 Table 1: Bias scores reported by 17 different LLMs when using different types of prompts, evaluated on the MGBR benchmark. Female vs. Male bias scores are separated by ‘/’ in the Table. and is used as a pro-stereotypical text. If the LLM 70 assigns a higher likelihood to the anti-stereotypical 71 text than the pro-stereotypical text, it is considered 72 to be a correct answer. Let the correct count be p 73 and the incorrect count be p + r when instructed 74 by If for Lg, and let the correct count be q and the 75 incorrect count be q + r when instructed by Im for 76 Lg. Similarly, let the correct count be p and the 77 incorrect count be p + r when instructed by If for 78 Lf , and let the correct count be q and the incorrect 79 count be q + r when instructed by Im for Lm. 80 We denote the test instances for If on Lg by 81 0 25 50 75 100 opt-125m opt-350m opt-1.3b opt-2.7b opt-6.7b opt-13b opt-30b opt-66b Few-shot Few-shot+Debiased Few-shot+CoT Figure 2: Accuracy of the Few-shot, Few-shot+CoT, accuracy

Temporal Social Biases 26 Most social biases do remain constant
over (a sho rt period of) time. [Zhou+ EMNLP’24]

Challenges of LLMs — $$$ • Training LLMs from scratch
is beyond academic budgets • Llama3 was trained on 24K H100 GPUs = USD 720M • Fine-tuning is also expensive, even if Parameter E ffi cient Fine-Tuning (PEFT) methods are used. • In-context Learning (prompting) is the only alternative in most cases (especially with closed models such as GPT-4 etc.) • Use Retrieval Augmented Generation (RAG) if you have a larger dataset. 27

Final Remarks • Legal sector needs the help of NLP
• Helps to prioritise claims and provide legal advice to a wide group of customers (e.g. medical negligences) [Bevan+IJCAI’19, Torissi+JURIX’19] • Passing Bar Exam Ability to make legal decisions • accuracy, explainability, consistency, and fairness are impo rt ant traits • Lots of good innovations, but more work to be done… ≠ 28

29 Danushka Bollegala h tt ps://danushka.net [email protected] @Bollegala Th ank
Y o

Leveraging Large Language Models for Fair and T...

Leveraging Large Language Models for Fair and Transparent Legal Decision-Making

Danushka Bollegala

More Decks by Danushka Bollegala

Other Decks in Research

Featured

Transcript