Happiness n Sadness n Neutral Target n Neutral n Neutral n Happiness n Sadness Target Multi-speaker Single-speaker with VC Single-speaker with PS&VC n Neutral n Neutral n Happiness n Sadness via VC w/ PS via VC
5 VC-TTS-PS VC-TTS-PS (40%) Naturalness Experiments with less than half of the recorded speech data of the target speaker 3.82 3.87 3.67 3.63 1 2 3 4 VC-TTS-PS VC-TTS-PS (40%) Emotional Similarity Happiness Sadness
combines PS-based and VC-based data augmentation. n Our method improved the performance of emotional TTS. In particular, the experiments showed that our method significantly enhanced the naturalness and emotional similarity of speech with a happiness style. Audio samples https://ryojerky.github.io/demo_vc-tts-ps/ Summary
How Coharis works ⎯ Introduce microservice architecture on Coharis ⎯ Benefits and difficulties after rebuilding Controllable, High-quAlity, and expRessIve TTS
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” For TTS For text processing
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Raw text with acoustic parameters is given through REST
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Split and normalize sentences e.g., 10:00 am → 午前10時
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert raw text into appropriate pronunciation e.g., 午前10時 → ゴゼンジュージ
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate pause positions e.g., あのねあそこにね → あのね / あそこにね
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Predict appropriate accents e.g., 音声 → オ_H / ン_L / セ_L / ー_L
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Align and mix up all linguistic features for TTS
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert linguistic features into acoustic features
Grapheme-to-phoneme Text segmentation / normalization Mix up linguistic features Acoustic model Vocoder Coharis server Raw text Raw speech “こんにちは” Convert acoustic features into speech
specific APIs Fast development Required to utilize GPU resources efficiently since the number of available GPUs is limited Let ML engineer focus only on developing ML modules (Not Server/Infrastructure) Less time to update models and dictionaries compared with monolithic architecture Scale up each module independently
SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2
APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 We can scale up only high GPU usage modules independently
API SERVICE APP API SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Aggregate data coming from each backend service to offer App specific API
API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Different Requests and Responses are given. And resource usage for each backend service is different
API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Generate speech from raw text Activated modules
API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 NAVER GATEWAY AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate linguistic features from raw text
API SERVICE APP SERVICE CLIENTS Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE EXTERNAL SERVICE APP SERVICE 1 APP SERVICE 2 Activated modules Generate speech from linguistic features
Preprocessing G2P Phrase break Accent (TTS components) PROXY SERVICE CLOVA VOICE AUDIOBOOK APP1 APP2 CLOVA SERVICE AUDIOBOOK SERVICE APP SERVICE 1 APP SERVICE 2 takes less time to update them since they are independent
gPRC protocol Use CPU-based workers to handle the sharp increase in requests Requests coming in short intervals are batched together in the proxy server Each service is connected by gRPC to reduce network delay Support CPU-based TTS workers with GPU-based one
service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU-based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs
service Normalization service Segmentation service Text formatter service Acoustic model service Vocoder service BACKEND SERVICES Proxy service Redis CPU based workers Model storage gRPC App services REST Cache storage Load pre-trained models Queue TTS jobs Wrapped by ML serving tool e.g. Triton, Torch-serve, seldon-core
on backend services without the effects on other services Only backend services with high load can be scaled up and can use GPU effectively Need to consider the release procedure carefully Effective development Setup local environment for researchers Previous Coharis is still used for research purposes since this is easier to set up J J L L