NGUYEN TIEN THANH
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
---------------------------------------
Nguyen Tien Thanh
COMPUTER SCIENCE
VIETNAMESE SPEECH SYNTHESIS FOR
SOME ASSISTANT SERVICES ON MOBILE DEVICES
MASTER OF SCIENCE THESIS
COMPUTER SCIENCE
2014B
Hanoi 2016
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page ii
MINISTRY OF EDUCATION AND TRAINING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
---------------------------------------
Nguyen Tien Thanh
VIETNAMESE SPEECH SYNTHESIS FOR
SOME ASSISTANT SERVICES ON MOBILE DEVICES
Department : International research institute MICA
MASTER THESIS OF SCIENCE
COMPUTER SCIENCE
SUPERVISOR:
Dr. Mac Dang Khoa
Hanoi 2016
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page iii
COMMITMENT
I commit myself to be the person who was responsible for conducting this
study. All reference figures were extracted with clear derivation. The presented
results are truthful and have not published in any other person‟s work.
Nguyn Tiến Thành
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page iv
ACKNOWLEDGEMENT
During the progress of master student, many people gave me generous help and
inspiration.
I wish to thank all my professors and colleagues at MICA International
Research Institute, who have helped me with generous supports. Their advice and
knowledge they imparted to me are gratefully appreciated, inspiring me a lot to
finish this thesis.
Special thanks to my supervisor Dr. Mạc Đăng Khoa and colleagues of Speech
Communication Department, MICA Institute for their advice and encouragement
they gave to me, especially Assoc. Prof. Trn Đỗ Đạt for their thorough review and
invaluable suggestions
I would like to thank to Mr. Nguyn Mạnh and Ms. Nguyễn Hằng Phương
for their guide in recording the corpus. I would also like to thank to a lot of MICA
members, who spent much of time for testing for my research.
I am grateful to Prof. Eric Castelli, Dr. Nguyn Việt Sơn and MICA‟s
directorate for supporting me the best working conditions in MICA International
Research Institute.
Finally, I owe a great deal to my parents and my younger brother for their
encouragement and support. They have given me strength and motivation in my
work and in my life.
Nguyn Tiến Thành
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page v
List of figures
Figure 1-1 Representation of sound.(Huang et al. 2001) ............................................ 4
Figure 1-2 A schematic diagram of the human speech production apparatus (Huang
et al. 2001) ................................................................................................................... 6
Figure 1-3 Glottal airflow and the resulting sound pressure at the mouth (Rabiner
and Juang 1993) .......................................................................................................... 7
Figure 1-4 Waveform plot of the beginning of the utterance “It‟s time”(Huang et al.
2001)............................................................................................................................ 8
Figure 1-5 Signal of sound “my speech” and its spectrogram .................................... 9
Figure 1-6 Speech recognition and speech synthesis (Chandra and Akila 2012) ..... 10
Figure 1-7 Schematic of text-to-speech synthesis ..................................................... 11
Figure 1-8 A schematic of the construction of an articulatory speech synthesizer and
how a such a synthesizer may be considered to contain a model of information
encoding in the speech signal (Palo 2006) ................................................................ 14
Figure 1-9 Block diagram of a synthesis-by-rule system. Pitch and formants are
listed as the only parameters of the synthesizer for convenience. In practice, such
system has about 40 parameters. (Huang et al. 2001) ............................................... 15
Figure 1-10 Core architecture of HMM-based speech synthesis system (Yoshimura
2002).......................................................................................................................... 18
Figure 1-11 General HMM-based synthesis scheme (Zen et al. 2009) .................... 19
Figure 1-12 A diagram of the Hunt and Black algorithm, showing one particular
sequence of units and how the target cost measures a distance between a unit and
the specification, and how the join cost measures a distance between the two
adjacent units (Taylor 2009) ..................................................................................... 25
Figure 2-1 Schematic diagram of Hanoi Vietnamese tones (Michaud 2004) ........... 35
Figure 2-2 Base system of Vu Hui Quan consists of 2 parts: training part and
synthesis part.(Quan and Nam 2009) ........................................................................ 36
Figure 2-3 Vietnamese speech recognition system (Vu et al. 2006) ........................ 37
Figure 2-4 Non-uniform unit selection model (Van Do et al. 2011) ........................ 38
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page vi
Figure 2-5 Parse tree to search (Van Do et al. 2011) ................................................ 39
Figure 3-1 Target cost of target units and candidate units (Tran 2007) .................... 42
Figure 3-2 Sentence splits into phrases and syllables ............................................... 44
Figure 3-3 Average length of syllables in different positions (Tran 2007) .............. 45
Figure 3-4 Average length of syllables (Tran 2007) ................................................. 46
Figure 3-5 Signal of “giỏi” syllable in two difference positions .............................. 47
Figure 3-6 Sub-cost based on the difference in position of phrase ........................... 49
Figure 3-7 Sub-cost based on the difference in context of preceding syllable and
following syllable ...................................................................................................... 50
Figure 3-8 Syllable “Quanh” is composed of four phonemes .................................. 51
Figure 3-9 Sub-cost based on the difference in context of preceding phoneme and
following phoneme .................................................................................................... 51
Figure 3-10 Acoustic units network .......................................................................... 56
Figure 3-11 The algorithm of separating sentence into as long as possible phrases 57
Figure 3-12 Finding the longest phrase in database .................................................. 58
Figure 3-13 Search space before applying acoustic units network ........................... 59
Figure 3-14 Search space after applying acoustic units network .............................. 60
Figure 3-15 Finding candidates of word “chúng tôi” ............................................... 61
Figure 4-1 Interface of Adobe Audition 3.0 ............................................................. 65
Figure 4-2 Interface of Praat ..................................................................................... 66
Figure 4-3 Most test result by domain ...................................................................... 68
Figure 4-4 Perception test ......................................................................................... 69
Figure 4-5 Result of the perception test .................................................................... 70
Figure 4-6 Speed of synthesis process of two systems ............................................. 72
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page vii
List of tables
Table 1-1 Types of using some popular units ........................................................... 29
Table 2-1 The concluded structure of Vietnamese syllables (Tran 2003) ................ 33
Table 2-2 Symbol of Vietnamese tones .................................................................... 34
Table 2-3 Advantages and disadvantages between two synthesis systems of Quan
and Thao .................................................................................................................... 40
Table 3-1 Position difference and cost value (min is better). Target unit is begin or
end of sentence .......................................................................................................... 48
Table 3-2 Position difference and cost value (min is better). Target unit is both
begin and end or is middle of sentence. .................................................................... 48
Table 3-3 Phoneme types in Vietnamese (Tran 2007) .............................................. 52
Table 3-4 Direction and complexity of Vietnamese tones ........................................ 54
Table 4-1. Number of sentences and distinct syllables in each domain ................... 63
Table 4-2 Tags and Meaning of xml file ................................................................... 67
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page viii
Contents
COMMITMENT ....................................................................................................... iii
ACKNOWLEDGEMENT ........................................................................................ iv
List of figures .............................................................................................................. v
List of tables ............................................................................................................. vii
Introduction ................................................................................................................. 1
Chapter 1. Overview of speech processing and text-to-speech ................................ 4
1.1. Speech and speech processing ....................................................................... 4
1.1.1. Sound ...................................................................................................... 4
1.1.2. Human vocal mechanism ........................................................................ 5
1.1.3. Speech representation in the time and frequency domains ..................... 7
1.1.4. Speech processing ................................................................................. 10
1.2. Text-To-Speech ........................................................................................... 11
1.2.1. Introduction ........................................................................................... 11
1.2.2. Speech synthesis techniques ................................................................. 12
1.2.3. Articulatory synthesis ........................................................................... 13
1.2.4. Formant synthesis ................................................................................. 15
1.2.5. Concatenative synthesis ........................................................................ 16
1.2.6. Statistical Parametric synthesis ............................................................. 17
1.3. From concatenative synthesis to unit selection synthesis ........................... 21
1.3.1. Extending concatenative synthesis ....................................................... 21
1.3.2. The algorithm of Hunt and Black ......................................................... 24
1.3.3. Speech synthesis based on non-uniform units selection .......................... 27
1.4. Conclusion ................................................................................................... 30
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page ix
Chapter 2. Text-to-speech for Vietnamese ............................................................. 31
2.1. Overview Vietnamese language and phonology ......................................... 31
2.1.1. Characteristics ....................................................................................... 31
2.1.2. Vietnamese syllable structure ............................................................... 33
2.2. Overview text-to-speech in Vietnamese ...................................................... 35
2.3. Discussion and proposal .............................................................................. 39
2.4. Conclusion ................................................................................................... 41
Chapter 3. Improvement of Non-uniform unit selection technique for Vietnamese
Text-to-speech ........................................................................................................... 42
3.1. Quality improvement: using target costs for unit selection ......................... 42
3.1.1. Target costs in Vietnamese synthesis ................................................... 42
3.1.2. Separating sentence into phrases .......................................................... 43
3.1.3. Target cost computation........................................................................ 44
3.2. Performance improvement: using acoustic units network .......................... 55
3.2.1. Acoustic units network ......................................................................... 55
3.2.2. Separating sentence into the longest phrases ........................................ 56
3.2.3. Searching candidates............................................................................. 59
3.3. Conclusion ................................................................................................... 61
Chapter 4. Implementations and evaluation ........................................................... 62
4.1. System overview .......................................................................................... 62
4.2. Building database ........................................................................................ 62
4.2.1. Text database building .......................................................................... 62
4.2.2. Speech corpus recording ....................................................................... 64
4.2.3. Database processing .............................................................................. 64
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page x
4.3. Evaluation .................................................................................................... 67
4.3.1. Quality of synthesized speech .............................................................. 67
4.3.2. Cost target improvement ....................................................................... 69
4.3.3. Performance .......................................................................................... 71
4.4. Conclusion ................................................................................................... 73
Chapter 5. Conclusions and perspectives ............................................................... 74
References ................................................................................................................. 76
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 1
Introduction
Context
Most people have heard about some synthetic voices in their life. We
experienced them in a number of situations. For instances, some telephone
information systems have automated speech response, or speech synthesis is often
used as an aid to the disabled.
Text-to-speech (TTS) systems have been integrated in many applications. One
of the useful applications is reading for blind people application, which can read
any text from a book and convert it into speech. Being known as Talkback, this kind
of application has been developed and integrated by Google on Android OS.
Talkback can read text displayed on the screens of Android devices to help blind
people use these devices easily.
The mainstream adoption of TTS has been severely limited by its quality. In
recent years, the considerable advance in their quality have made TTS systems are
becoming more common. Probably the main use of TTS today is in call-centre
automation, where a user calls to pay an electricity bill or book some travel and
conducts the entire transaction through an automatic dialogue system. Beyond this,
TTS systems have been used for reading news stories, weather reports, travel
directions and a wide variety of other applications.
In recent times, smart devices such as smartphones, tablets, etc. are increasingly
popular and play an important role in our life. They can be used in education,
medical, transport, communication, and so on. In Vietnam, some TTS systems have
been studied and developed on the mobile devices, such as : vnSpeak, Viettel
Speak, etc. At MICA international research institute, researchers have also
developed some TTS systems integrated into numbers of applications such as
VIVA, VIVAVU, VIQ on Google Play. However, these systems still exist some
limitations such as poor voice quality, long response time, etc...
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 2
Our goals in building a system that is capable of speaking from text can be
applied to smart devices and overcome the mentioned weaknesses. We hope that
this system will bring advantages in life for us
Objective of this thesis
This thesis was realized at MICA institute, Speech Communication department
and its main goal is to build a high quality Vietnamese speech synthesis system that
can be integrated into electronic devices running on Android OS.
Basic theory of speech synthesis is firstly studied. Then, new methods to
improve quality of the existed Vietnamese synthesis system, that is driven to run on
smartphones and smart devices, will be proposed.
The first task is building a speech corpus for synthesizing Vietnamese
utterances. With this corpus, we can synthesize almost all syllables of Vietnamese
and can apply Text-to-speech system to any Vietnamese documents
After that, based on researches about Vietnamese phonetic and Vietnamese
synthesis, some new costs for calculating optimal way in speech synthesis using
unit selection technique were proposed. The costs are expected help us choose more
preferable units to synthesize utterance.
Moreover, we also suggest using a phonetic units network to optimize searching
and selecting time of candidate units.
Finally, all these researches and suggestions will be applied to a speech
synthesis system that can be embedded in assistant applications on smartphones.
Thesis structure
Chapter 1 presents basic theories of speech, giving the background of speech
signal, speech signal processing and speech synthesis.
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 3
Chapter 2 focuses on theories of speech synthesis using unit selection
technique. It also introduces current Vietnamese speech synthesis and gives
suggestions.
Chapter 3 is our research on target cost used for selecting units in speech
synthesis. We also describe an acoustic unit network which is used for improving
performance of the TTS system
In chapter 4, our work on building the Vietnamese speech corpus is presented.
Experiments for evaluating the quality of the new TTS system are also presented.
Final part completes with conclusions of the thesis work and suggestions for
further work.
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 4
Chapter 1. Overview of speech processing and text-to-
speech
1.1. Speech and speech processing
In this section, we briefly review speech sound and human speech production
systems. We also show how speech signal can be represented.
1.1.1. Sound
Sound is a longitudinal pressure wave formed of compressions and rarefactions
of air molecules, in a direction parallel to that of the application of energy.
Compressions are zones where air molecules have been forced by the application of
energy into a tighter-than-usual configuration, and rarefactions are zones where air
molecules are less tightly packed.
The alternating configurations of compression and rarefaction of air molecules
along the path of an energy source are sometimes described by the graph of a sine
wave as shown in Figure 1-1
Figure 1-1 Representation of sound.(Huang et al. 2001)
In this representation, crests of the sine curve correspond to moments of
maximal compression and troughs to moments of maximal rarefaction. There are
two important parameters, amplitude and wavelength, to describe a sine wave.
Frequency (calculated by cycles/second) measured in Hertz (Hz) is also used to
measure of the waveform
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 5
1.1.2. Human vocal mechanism
A schematic diagram of the human vocal mechanism is shown in Figure 1-2.
The gross components of the speech production apparatus are the lungs, trachea,
larynx (organ of voice production), pharyngeal cavity (throat), oral and nasal cavity.
The pharyngeal and oral cavities are typically referred to as the vocal tract, and the
nasal cavity as the illustrated in Figure 1-2, the human speech production apparatus
consists of:
- Lungs: source of air during speech.
- Vocal cords (larynx): when the vocal folds are held close together and
oscillate against one another during a speech sound, the sound is said to be voiced.
When the folds are too slack or tense to vibrate periodically, the sound is said to be
unvoiced. The place where the vocal folds come together is called the glottis.
- Velum (Soft Palate): operates as a valve, opening to allow passage of air (and
thus resonance) through the nasal cavity. Sounds produced with the flap open
include m and n.
- Hard palate: a long relatively hard surface at the roof inside the mouth, which,
when the tongue is placed against it, enables consonant articulation.
- Tongue: flexible articulator, shaped away from the palate for vowels, placed
close to or on the palate or other hard surfaces for consonant articulation.
- Teeth: another place of articulation used to brace the tongue for certain
consonants.
- Lips: can be rounded or spread to affect vowel quality, and closed completely
to stop the oral air flow in certain consonants (p, b, m).
Master of science thesis
2016
Nguyn Tiến Thành - Vietnamese speech synthesis for assistant services on mobile devices Page 6
Figure 1-2 A schematic diagram of the human speech production apparatus (Huang et al.
2001)
Air enters the lungs via the normal breathing mechanism. As air is expelled
from the lung to the trachea (or windpipe), the tensed vocal cords within the larynx
are caused to vibrate (in the mode of relaxation oscillator) by the air flow. The air
flow is chopped in to quasi-periodic pulses which are the modulated in frequency in
passing through the pharynx (the throat cavity), the mouth cavity, and possibly the
nasal cavity. Depend on the positions of the various articulators (i.e. jaw, tongue,
velum, lips, mouth) different sounds are produced.
The glottal air flow (volume velocity wave form) and the resulting sound
pressure at the mouth for a typical vowel sound is shown in Figure 1-3. The glottal
waveform shows a gradual build-up to a quasi-periodic pulse train of air, taking
about 15 ms to reach steady state. This build-up is also reflected in the acoustic
waveform shown at the bottom of the figure.