Whisper보다 6배빠른 distil-whisper로 오디오 데이터에서 RAG 수행기

1 Whisper보다 6배 빠른 Ditil-Whisper로 오디오 데이터에서 RAG 수행기 백
혜 림 LangChainKR 2023

연사소개 이전에 주로 하던 분야 Speech enhancement, Source separation 퇴사
후 주로 요즘 관심있는 분야는 LLM, Speech, Mutimodal, Mlops + 강의 백 혜 림 [email protected] 2 LinkedIn : https://www.linkedin.com/in/hyerimbaek-227489185/ TechBlog : https://rimiyeyo.tistory.com

3 목차 1.Overview 2.STT Model 3.FlashAttention 4.TextSplit & Embedding 5.RAG
6.DEMO

4 [email protected] 백혜림 Project Overview Project Overview

5 Overview Youtube url 입력 오디오 추출 crawling Distil- Whisper
23.11.2 Huggingface 출시 Audio에서 추출한 Text Huggingface Embedding Vector Database Text Splitters Embeddings LLM 질문 답변 Relevancy ChatGPT API

6 [email protected] 백혜림 STT MODEL STT MODEL

7 Whisper 2022년 9월에 출시된 음성인식(ASR, Speech to Text) 다양한
언어로 학습하고, 훈련데이터 총 680,000시간 중 한국어 8,000시간 학습 출처 : https://arxiv.org/abs/2212.04356

8 Distil-Whisper Distil-Whisper, Whisper보다 속도는 6배 빠르다, 성능은 1%로 유지
인코더는 유지하고 디코더는 2개만 유지. Weighted distillation loss사용 라이선스가 있는 9개의 오픈 소스 데이터 셋 22,000시간 분량. 훈련용 라벨은 whisper로 pseudo label생성. WER 필터가 적용되어 WER 점수가 10% 이상인 라벨만 유지된다는 것 출처 : https://arxiv.org/abs/2311.00430

Flash attention2에 주목!

11 Flash attention2 Distil-Whisper, Whisper보다 속도는 6배 빠르다, 성능은 1%로
유지 Flash attention2에 주목! pip install flash-attn --no-build-isolation

유지 Flash attention2에 주목! 일반 컴퓨터에서는 돌아가지 않습니다! pip install flash-attn --no-build-isolation RuntimeError: FlashAttention only supports Ampere GPUs or newer.

유지 Flash attention2에 주목! 일반 컴퓨터에서는 돌아가지 않습니다! Flash Attention2는 최신 Nvidia GPU 3090 및 4090, A100, H100 pip install flash-attn --no-build-isolation

14 Flash attention2 GPU는 런타임시 3가지 메모리에 접근할 수 있습니다.
실제 실행 코어와 함께 위치한 on-chip 메모리 크기는 제한되어있지만 (A100의 경우 최대 20MB) 매우 빠름! 칩 외부이지만 카드 내부 메모리. 즉, gpu에 있지만 코어 자체와 같은 위치에 있지 않음 A100에는 40GB의 HBM있지만 대역폭 1.5TB 전통적인 CPU RAM 너무 느림 GPU 아키텍쳐

15 Flash attention 출처 : https://www.marktechpost.com/2022/06/03/researchers-at-stanford-university-propose-flashattention-fast-and-memory-efficient-exact-attention-with-io-awareness/ FlashAttention은 캐시된 키/값 블록을
SRAM에 저장하여 각 단계마다 고대역폭메모리(HBM)에서 다시 읽는 것을 방지 느린 HBM에 대한 액세스를 줄이고 처리량을 향상시킵니다. GPU 스레드 활용 및 작업 분할에서 여전히 일부가 비효율성

16 Flash attention2 출처 : https://crfm.stanford.edu/2023/07/17/flash2.html - 비 matmul flop
감소 : GPU에서 더 느린 비 행렬 곱셈 연산을 줄이도록 알고리즘이 조정되고, 빠른 matmul FLOP의 비율을 증가 - 더 나은 병렬성 : 배치 크기/헤더를 넘어 시퀀스 길이에 대한 병렬성을 추가. 이는 긴 시퀀스에 대한 GPU 활용도를 향상 - 더욱 스마트하진 작업 분할 : 작업을 GPU스레드로 나누어 불필요한 동기화와 공유 메모리 사용량을 줄입니다. 동일한 시간 프레임에서 훨씬 더 긴 context로 모델을 훈련이 가능!

17 Whisper vs Distil-Whisper 출처 : https://www.linkedin.com/posts/liorsinclair_a-team-just-made-openais-whisper-6x-faster-activity-7130263371920097280-cQsi?utm_source=share&utm_medium=member_desktop 데모 체험 :
https://huggingface.co/spaces/Xenova/distil-whisper-web

18 [email protected] 백혜림 Text Splitter Text Splitter

19 Overview Youtube url 입력 오디오 추출 crawling Distil- Whisper
23.11.2 Huggingface 출시 Audio에서 추출한 Text Huggingface Embedding Vector Database Text Splitters Embeddings LLM 질문 답변 Relevancy ChatGPT API

20 Text Splitter 문장에 의미론적 의미를 지닌 작은 조각으로 텍스트를
나누는 도구입니다. Text Splitter가 중요한 이유! LLM모델마다 max token의 수가 다르기 때문.

나누는 도구입니다. Text Splitter가 중요한 이유! LLM모델마다 max token의 수가 다르기 때문. RecursiveCharacterTextSplitter는 기본적으로 토큰 수가 아닌 문자 수로 분할

나누는 도구입니다. Text Splitter가 중요한 이유! LLM모델마다 max token의 수가 다르기 때문. RecursiveCharacterTextSplitter는 기본적으로 토큰 수가 아닌 문자 수로 분할 RecursiveCharacterTextSplitter.from_tiktoken_encoder는 토큰 수로 분할

23 Embedding 텍스트나 이미지와 같은 비정형 데이터를 고차원의 벡터 형태로
변환하는 것을 의미 자연어처리에서 단어나 문장을 고차원 벡터로 변환하는 작업은 단어 임베딩 임베딩의 주요 목적은 원래의 데이터의 의미나 특성을 최대한 보존하면서 연산이 가능한 형태로 변환하는 것 Text Splitters https://projector.tensorflow.org/

24 Embedding 출처 : https://www.graphable.ai/blog/knowledge-graph-embeddings/ https://pranay-dave9.medium.com/openai-embeddings-the-key-to-powerful-text-clustering-342706b22d12

25 VectorStore 임베딩 및 관련 문서를 저장하고 쿼리할 수 있는
특별한 유형의 데이터베이스 “의미상 유사항”항목을 검색할 때, ‘모자를 쓴 고양이’와 유사한 문서를 검색하려는 경우 고양이, 모자 또는 모자를 쓴 기타 동물과 관련된 결과를 찾을 수 있습니다.

26 VectorStore 임베딩 및 관련 문서를 저장하고 쿼리할 수 있는
특별한 유형의 데이터베이스 결국 벡터스토어에서 원하는 검색 결과를 가져오기 위해서는 임베딩 벡터간의 유사도를 측정해야 함 벡터스터어는 코사인 유사도 검색을 최적화해서 수행할 수 있도록 설계되어 있음. 임베딩 벡터를 저장할 때 특정한 방식으로 인덱싱해 검색 시간을 크게 단축

27 Load LLM과 RAG수행 RAG(Retrieval Augmented Generation) 어떻게 질문에서 적절한
content를 선택할 수 있을까? 출처 : https://opentutorials.org/module/6369

28 1 2 3 들어주셔서 감사합니다 ☺ [email protected] 백혜림

Whisper보다 6배빠른 distil-whisper로 오디오 데이터에서 RAG 수행기

Whisper보다 6배빠른 distil-whisper로 오디오 데이터에서 RAG 수행기

백혜림

More Decks by 백혜림

Featured

Transcript

1 Whisper보다 6배 빠른 Ditil-Whisper로 오디오 데이터에서 RAG 수행기 백

연사소개 이전에 주로 하던 분야 Speech enhancement, Source separation 퇴사

3 목차 1.Overview 2.STT Model 3.FlashAttention 4.TextSplit & Embedding 5.RAG

4 [email protected] 백혜림 Project Overview Project Overview

5 Overview Youtube url 입력 오디오 추출 crawling Distil- Whisper

6 [email protected] 백혜림 STT MODEL STT MODEL

7 Whisper 2022년 9월에 출시된 음성인식(ASR, Speech to Text) 다양한

8 Distil-Whisper Distil-Whisper, Whisper보다 속도는 6배 빠르다, 성능은 1%로 유지

9 Distil-Whisper Distil-Whisper, Whisper보다 속도는 6배 빠르다, 성능은 1%로 유지

10 Distil-Whisper Distil-Whisper, Whisper보다 속도는 6배 빠르다, 성능은 1%로 유지

11 Flash attention2 Distil-Whisper, Whisper보다 속도는 6배 빠르다, 성능은 1%로

12 Flash attention2 Distil-Whisper, Whisper보다 속도는 6배 빠르다, 성능은 1%로

13 Flash attention2 Distil-Whisper, Whisper보다 속도는 6배 빠르다, 성능은 1%로

14 Flash attention2 GPU는 런타임시 3가지 메모리에 접근할 수 있습니다.

15 Flash attention 출처 : https://www.marktechpost.com/2022/06/03/researchers-at-stanford-university-propose-flashattention-fast-and-memory-efficient-exact-attention-with-io-awareness/ FlashAttention은 캐시된 키/값 블록을

16 Flash attention2 출처 : https://crfm.stanford.edu/2023/07/17/flash2.html - 비 matmul flop

17 Whisper vs Distil-Whisper 출처 : https://www.linkedin.com/posts/liorsinclair_a-team-just-made-openais-whisper-6x-faster-activity-7130263371920097280-cQsi?utm_source=share&utm_medium=member_desktop 데모 체험 :

18 [email protected] 백혜림 Text Splitter Text Splitter

19 Overview Youtube url 입력 오디오 추출 crawling Distil- Whisper

20 Text Splitter 문장에 의미론적 의미를 지닌 작은 조각으로 텍스트를

21 Text Splitter 문장에 의미론적 의미를 지닌 작은 조각으로 텍스트를

22 Text Splitter 문장에 의미론적 의미를 지닌 작은 조각으로 텍스트를

23 Embedding 텍스트나 이미지와 같은 비정형 데이터를 고차원의 벡터 형태로

24 Embedding 출처 : https://www.graphable.ai/blog/knowledge-graph-embeddings/ https://pranay-dave9.medium.com/openai-embeddings-the-key-to-powerful-text-clustering-342706b22d12

25 VectorStore 임베딩 및 관련 문서를 저장하고 쿼리할 수 있는

26 VectorStore 임베딩 및 관련 문서를 저장하고 쿼리할 수 있는

27 Load LLM과 RAG수행 RAG(Retrieval Augmented Generation) 어떻게 질문에서 적절한

28 1 2 3 들어주셔서 감사합니다 ☺ [email protected] 백혜림