Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Leveraging Large Language Models for Fair and T...

Leveraging Large Language Models for Fair and Transparent Legal Decision-Making

Large Language Models (LLMs) like GPT-4, LLaMA, and Gemini have demonstrated remarkable capabilities across a wide range of tasks, often exceeding human-level performance in specific domains. For instance, GPT-4 achieved a top 10% score on the U.S. Unified Bar Exam, highlighting its potential in the legal domain. However, the use of LLMs in legal decision-making introduces critical challenges, including the need to detect and mitigate social biases and ensure transparent, human-understandable explanations. In this talk, Danushka Bollegala will present ongoing research from my lab at the University of Liverpool, focusing on the intersection of AI and law. Specifically, he will discuss methods for identifying and addressing social biases in legal contexts and the role of LLMs in promoting equitable and explainable legal outcomes.

Danushka Bollegala

January 19, 2025
Tweet

More Decks by Danushka Bollegala

Other Decks in Research

Transcript

  1. Leveraging Large Language Models for Fair and Transparent Legal Decision-Making

    Professor Danushka Bollegala University of Liverpool
  2. Law needs help from NLP 2 The number of criminal

    cases in the Crown Cou rt in the UK h tt ps://www.gov.uk/government/statistics/criminal-cou rt -statistics-qua rt erly-july-to-september-2023/criminal-cou rt -statistics-qua rt erly-april-to-june-2023
  3. 3

  4. LLMs are impressive! 4 GPT-4 Technical Repo rt , Open

    AI, 2023. Corresponds to top 10% of the human candidates
  5. 5 A father lived with his son, who was an

    alcoholic. When drunk, the son o ft en became violent and physically abused his father. As a result, the father always lived in fear. One night, the father heard his son on the front stoop making loud obscene remarks. The father was ce rt ain that his son was drunk and was terri fi ed that he would be physically beaten again. In his fear, he bolted the front door and took out a revolver. When the son discovered that the door was bolted, he kicked it down. As the son burst through the front door, his father shot him four times in the chest, killing him. In fact, the son was not under the in fl uence of alcohol or any drug and did not intend to harm his father. At trial, the father presented the above facts and asked the judge to instruct the jury on self-defense. How should the judge instruct the jury with respect to self-defense? (A) Give the self-defense instruction, because it expresses the defense’s theory of the case. (B) Give the self-defense instruction, because the evidence is su ff i cient to raise the defense. (C) Deny the self-defense instruction, because the father was not in imminent danger from his son. (D) Deny the self-defense instruction, because the father used excessive force.
  6. 6

  7. Legal LLMs also exist • LawGPT_zh • Chinese legal LLM

    based on ChatGLM-6B LoRA FT by Shanghai Jiao Tong University • LawGPT [Nguyen+23] • Chinese legal LLM based on Chinese-LLaMA. Fine tuned on judgements from China judgement document network • LexiLaw • Fine tuned ChatGLM-6B on BELLE-1.5M, legal QA, legal documents, laws and regulations, legal reference books • Lawyer LLaMA [Huang+23, Touvron+23] • Chinese legal LLM, fi ne-tuned on China’s national uni fi ed legal professional quali fi cation examination and responses to legal consultations. 8 Lai+24
  8. Scaling Laws of LLMs 9 Computation power, Training Dataset sizes,

    and Model Parameters all show a power-law relationship with Test Loss (i.e. test pe rf ormance) [Kaplan+2020]
  9. LLMs when used in the Legal Decision Making Process must

    …. • Be accurate (adhere to the current law and precedences) • How to ensure that the decisions are made within a speci fi c legal system, jurisdiction, country? • How to “update” the legal knowledge of the LLM when laws change (and they do a lot!) • fi ne-tuning, knowledge editing, RAG, … • However, training only on the fi nal decisions might not be enough. What arguments were considered in the cou rt could also be valuable training data, but di ffi cult to obtain. 12
  10. LLMs when used in the Legal Decision Making Process must

    …. • Provide Explanations • Right to Explanation: Justice must not only be done, but must be seen to be done, and, without an explanation, the required transparency is missing [Atkinson+20] • Logic expressions as explanations [Sun+24] • LegalGPT: CoT reasoning for Law [Shi+24] • Explanations to who? at what level? • Applications 13 Legal case retrieval systems Patent Essentiality Review Systems Atkinson+Bollegala [2023]
  11. LLMs when used in the Legal Decision Making Process must

    …. • Be consistent • Legal decisions must be consistent (i.e. if the crime is the same then the punishment should also be the same) • However, LLMs are not deterministic • Over sensitivity to the prompts • Temperature of the decoder in fl uences the sampling • Hallucinations can suddenly appear for no pa rt icular reason. 14
  12. Prompt Sensitivity 15 LLMs are sometimes insensitive to negations [Hosseini+21,

    E tt inger’20, Hossain+20, Ryoma’24] Pe rf ormance can vary signi fi cantly due to nonsensical changes in prompts [Ishibashi+23]
  13. Hallucinations • LLMs do not always produce the correct answer.

    16 Pro tip: Never use LLMs for generating answers that you cannot verify by yourself!
  14. LLMs when used in the Legal Decision Making Process must

    …. • Be unbiased 17 A rt icle 21 - Non-discrimination Any discrimination based on any ground such as sex, race, colour, ethnic or social origin, genetic features, language, religion or belief, political or any other opinion, membership of a national minority, prope rt y, bi rt h, disability, age or sexual orientation shall be prohibited. Protected Characteristics Age, disability, gender reassignment, marriage and civil pa rt nership, pregnancy and maternity, race, religion or belief, sex, sexual orientation.
  15. Social Biases in GenAI 18 Images created by DALL-E (OpenAI)

    h tt ps://www.vice.com/en/a rt icle/wxdawn/the-ai-that-draws-what-you-type-is-very-racist-shocking-no-one
  16. Social Biases in GenAI 20 Images generated by Stable Di

    ff usion (Stability.AI) “Janitor” “Asse rt ive Fire fi ghter” h tt ps://techpolicy.press/researchers- fi nd-stable-di ff usion-ampli fi es-stereotypes/
  17. Bias Suppression in LLMs 24 Despite being a female, Haley

    became an engineering manager Preamble test case -1 Anne was a skilled surgeon, who conducted many complex surgeries test case -2 John was a skilled surgeon, who conducted many complex surgeries )FMMB4XBHDPNNPOTFOTFSFBTPOJOH In-contextual Gender Bias Suppression for Large Language Models: Oba, Kaneko, Bollegala. EACL 2024.
  18. Unconscious Biases in LLMs • Chain-of-Thought (CoT) requires LLMs to

    provide intermediary explanations for its inferences. • Can CoT make LLMs aware of their unconscious social biases? 25 der Bias in Large Language Models MNLP submission Figure 1: Example of multi-step gender bias reasoning task. Kojima et al., 2022). 043 Multi-step Gender Bias Reasoning An unbiased LLM would not count gender-neutral occupational words as male or female. CoT instruction: Lets think Step-by-Step opt-125m 16.2 / 14.0 5.2 / 3.0 16.2 / 14.0 5.2 / 3.0 2.0 / 8.0 0.0 / 1.6 opt-350m 9.0 / 15.2 0.6 / 6.8 9.0 / 15.2 0.6 / 6.8 1.1 / 0.6 -0.9 / 1.2 opt-1.3b 2.6 / 0.6 2.6 / 1.0 2.6 / 0.6 2.6 / 1.0 -0.4 / -0.2 -0.6 / -0.4 opt-2.7b 14.8 / 17.0 3.4 / 2.8 14.8 / 17.0 3.4 / 2.8 0.0 / 0.2 1.8 / 0.0 opt-6.7b 7.6 / 2.6 5.8 / 1.7 7.6 / 2.6 5.8 / 1.7 0.4 / 0.2 0.0 / 0.5 opt-13b 17.0 / 23.6 4.8 / 0.4 17.0 / 23.5 4.8 / 0.4 0.0 / 0.0 2.0 / 0.4 opt-30b 23.2 / 25.4 6.2 / 6.6 23.0 / 25.2 6.1 / 6.4 0.0 / 0.0 0.0 / 0.0 opt-66b 25.6 / 31.2 17.6 / 25.0 25.3 / 30.9 17.4 / 25.0 0.0 / 0.0 0.0 / 0.0 gpt-j-6B 5.8 / 6.4 3.2 / 0.6 5.8 / 6.4 3.2 / 0.6 0.6 / 0.2 0.0 / 0.0 mpt-7b 1.8 / 1.8 0.8 / 5.0 1.8 / 1.8 0.8 / 5.0 0.4 / 0.6 17.0 / 15.2 mpt-7b-inst. 5.4 / 4.8 6.0 / 3.6 5.4 / 4.8 6.0 / 3.6 5.8 / 6.6 12.6 / 11.0 falcon-7b 2.8 / 4.0 0.2 / 0.4 2.8 / 4.0 0.2 / 0.4 0.0 / 8.6 0.0 / 0.0 falcon-7b-inst. 2.2 / 3.2 5.0 / 3.8 2.2 / 3.2 5.0 / 3.8 0.0 / 0.0 0.0 / 0.0 gpt-neox-20b 33.2 / 33.8 -0.1 / 3.0 33.0 / 33.6 0.0 / 2.9 0.0 / 0.0 7.4 / 3.0 falcon-40b 34.0 / 29.0 2.0 / 3.0 34.0 / 29.0 1.9 / 3.0 7.6 / 3.0 -0.2 / 0.0 falcon-40b-inst. 5.2 / 3.6 3.4 / 3.7 4.9 / 3.4 3.3 / 3.5 2.2 / 3.4 1.7 / 2.5 bloom 40.2 / 28.0 12.0 / 11.0 40.0 / 27.7 11.9 / 11.0 7.4 / 4.2 5.4 / 2.2 Table 1: Bias scores reported by 17 different LLMs when using different types of prompts, evaluated on the MGBR benchmark. Female vs. Male bias scores are separated by ‘/’ in the Table. and is used as a pro-stereotypical text. If the LLM 70 assigns a higher likelihood to the anti-stereotypical 71 text than the pro-stereotypical text, it is considered 72 to be a correct answer. Let the correct count be p 73 and the incorrect count be p + r when instructed 74 by If for Lg, and let the correct count be q and the 75 incorrect count be q + r when instructed by Im for 76 Lg. Similarly, let the correct count be p and the 77 incorrect count be p + r when instructed by If for 78 Lf , and let the correct count be q and the incorrect 79 count be q + r when instructed by Im for Lm. 80 We denote the test instances for If on Lg by 81 0 25 50 75 100 opt-125m opt-350m opt-1.3b opt-2.7b opt-6.7b opt-13b opt-30b opt-66b Few-shot Few-shot+Debiased Few-shot+CoT Figure 2: Accuracy of the Few-shot, Few-shot+CoT, accuracy
  19. Temporal Social Biases 26 Most social biases do remain constant

    over (a sho rt period of) time. [Zhou+ EMNLP’24]
  20. Challenges of LLMs — $$$ • Training LLMs from scratch

    is beyond academic budgets • Llama3 was trained on 24K H100 GPUs = USD 720M • Fine-tuning is also expensive, even if Parameter E ffi cient Fine-Tuning (PEFT) methods are used. • In-context Learning (prompting) is the only alternative in most cases (especially with closed models such as GPT-4 etc.) • Use Retrieval Augmented Generation (RAG) if you have a larger dataset. 27
  21. Final Remarks • Legal sector needs the help of NLP

    • Helps to prioritise claims and provide legal advice to a wide group of customers (e.g. medical negligences) [Bevan+IJCAI’19, Torissi+JURIX’19] • Passing Bar Exam Ability to make legal decisions • accuracy, explainability, consistency, and fairness are impo rt ant traits • Lots of good innovations, but more work to be done… ≠ 28