TTS

1. 마인즈랩의 TTS, 글을 쓰면 소리를 만들어준다

https://api.maum.ai/kr/tts/#none

API를 구매하려면 99000원을 내야한다... 구매 시, API를 이용하여 대량의 Synthesis data를 만들 수 있을 거 같다.

2. Carpedm20의 github

https://github.com/sokcuri/multi-speaker-tacotron-tensorflow

3. 김앵커 추가

https://github.com/melonicedlatte/multi-speaker-tacotron-tensorflow

http://hellogohn.com/post_one298

4. 박규병, KSS 한글 TTS 데이타셋

https://www.kaggle.com/bryanpark/korean-single-speaker-speech-dataset

전문 여자 성우(12시간, wav, 44,100Khz, 12853개, 3G)

본래 음성 441000Khz, 타코트론 샘플링 24000hz

타코트론 코드에서의 음성 파라미터

- Audio 샘플링

sample_rate = 24000, #
- shift can be specified by either hop_size(우선) or frame_shift_ms
hop_size = 300, # frame_shift_ms = 12.5ms
- n_fft. 주로 1024로 되어있는데, tacotron에서 2048사용

fft_size = 2048

- 50ms?

win_size = 1200

- ??

num_mels = 80,

- Spectrogram PreEmphasis ( Lfilter Reduce spectrogram noise and helps model certitude levels.

Also allows for better G&L phase reconstruction)

    preemphasize = True, #whether to apply filter
    preemphasis  = 0.97,
    min_level_db = -100,
    ref_level_db = 20,
    signal_normalization = True, #Whether to normalize mel spectrograms to some predefined range (following below parameters)
    allow_clipping_in_normalization = True, #Only relevant if mel_normalization = True
    symmetric_mels = True, #Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, faster and cleaner convergence)
    max_abs_value = 4., #max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not be too big to avoid gradient explosion, not too small for fast convergence)

Tacotron Model Architecture

훈련시

모델 입력 : Text, Mel-spectrogram, (linear)spectrogram

모델 출력 : Mel-spectrogram(예측) -> (linear)spectrogram(예측) -> Speech(Audio)

추론시

모델 입력 : Text

모델 출력 : Mel-spectrogram(예측) -> (linear)spectrogram(예측) -> Speech(Audio)

소리의 특징 추출

MFCC(Mel-frecuency cepstral coefficients)

- 음색의 특징을 나타낼 수 있는 feature, 배음의 구조를 잘 파악

- 악기, 사람의 소리 구분에 적합함

- Tacotron 에서는 MFCC가 아닌 log(Mel-Spectrogram)을 활용(고주파에 민감하므로 log)

- MFCC는 만드는 과정에서 소리의 많은 정보를 잃음, 복원을 고려해서 Mel-Spectrogram 이용

- Linear, Mel-spec, MFCC 모두 위상정보 없음 -> Griffin-Lim으로 복원

STFT(Short Time Fourier Transform)

1. 시간축 기준으로 sampling_rate(24000)을 사용한다. ->1초에 24000hz

일반 오디오, CD는 44100hz or 22050hz

실제 음성 : 44100hz

만든 음성 : 24000hz

박규병

tacotron : sr=22050, n_fft = 2048

dc-tts : 같음

2. 시간 축의 1d 음파를 시간별(T, window size=1200, window size/sampling_rate = 0.05초 단위)로 쪼갠다

3. 4초 음성이면, T = 4 * sr / hop_size = 4*24000/300 = 320 -> ? 뭔가 꼬였다

4. STFT의 가로축 : T, 세로축 : fft_size/2+1 -> (T, fft_size/2+1, amplitude)

- fft_size가 2048로 고정이므로 세로축은 고정

- hop_size(=300) 의 크기에 따라 T 결정. hop이 커지면 T 작아짐, sample별 길이가 다르므로 T가 다르다

1) 1_preprocess 수행

- wav -> npy, 쪼개진 wav 파일을 numpy로 변환

- data/son/"음성파일이름.npz" 생성

- data/son/train.txt : linear? 생성

- Mel-spectrogram, (linear)spectrogram 정답셋을 생성한다

- 손석희, 19시간 데이터, 10분 소요

- 파라미터

- sample_rate = 24000 : 샘플링
- hop_size = 300 : frame_shift_ms = 12.5ms
- fft_size = 2048, : n_fft. 주로 1024로 되어있는데, tacotron에서 2048사용

- win_size = 1200, : 50ms

- num_mels = 80

2) 2_train_tacotron2_3_multispeaker_complete 수행

- 훈련

- 파라미터

- self.data_paths = 'data/son,data/moon' # multi시 , 좌우로 부쳐써야한다
- self.load_path = None # 시작시

- self.load_path = 'logdir-tacotron2/son+moon_2019-04-05_18-10-57' # 이어서 돌릴시

- 10만 이상 돌려야 함

- batch = 32

'AI > 음성인식' 카테고리의 다른 글

바이든? 날리면? 주파수 기반 음성처리 노이즈 제거 (1)	2023.12.15

AI, NLP를 연구하는 엔지니어

TTS

'AI > 음성인식' 카테고리의 다른 글

티스토리툴바

TTS

'AI > 음성인식' 카테고리의 다른 글

'AI/음성인식' Related Articles

티스토리툴바