inital version exma
[itlast1617.git] / exam / q3.tex
1 \begin{enumerate}
2 % Question 3a
3 \item In an \emph{ASR} system we can expect problems in several phases.
4
5 The first phase of an \emph{ASR} is just extracting the features. We
6 do not expect problems there since it will just produce slightly
7 different features for some part but that is not something the feature
8 extraction cares about. It just objectively has to extract features and
9 since it is still human speech, there are no problem with disfluencies.
10
11 When trying to transform the cepstral features into a sentence several
12 components are involved. First a phone likelihood is calculated, we
13 maybe expect slight problems here since even the phones might be
14 reduced and the \emph{editing phase} words could just be rudimentary
15 sounds instead of phones and thus it might select suboptimal
16 likelihoods.
17
18 When decoding the phone likelihood into words a lexicon is used. This
19 lexicon might not contain the edit words and possibly also not the
20 reduced \emph{reparanda}.
21
22 Finally during \emph{Viterbi} \emph{N-Gram} models come into play and
23 if they are not extracted from a dataset that also included
24 disfluencies it might be the case that the probabilities of
25 disfluencies appearing are so low that it tries to fit similarly
26 sounding real words instead of the disfluency.
27
28 % Question 3b
29 \item Solving the problem of phone likelihood computation can be done by
30 shrinking the window of the feature extraction so that strongly reduced
31 phones are also correctly recognized.
32
33 Solving the second problem can be done by adding disfluency words to
34 the lexicon and also more reductive pronunciations of words.
35
36 Lastly we can increase the decoding performance by specifically
37 extracting the \emph{N-Gram} probabilities from data that also contains
38 disfluencies.
39
40 % Question 3c
41 \item To add disfluencies to speech synthesis one must know how they arise.
42 There are some word categories that have more disfluencies than others.
43 Also they may be produced to give speaker some more time to think about
44 the rest of the sentence. When you know such properties of disfluencies
45 you can model them in the speech synthesis in the normalization phase.
46
47 In the normalization phase the system can add disfluencies at sections
48 that often produce them. This most likely is the most effective in
49 tokenisation. Specific tokens that can be selected to be expanded to a
50 disfluency.
51
52 Later on in the pipeline the system must also be adapted. Namely in the
53 waveform synthesis. Depending on the technique applied some
54 improvements can be done. When the synthesis technique is unit
55 selection it might be helpful to have units for common disfluencies and
56 at least units for \emph{editing phase} words. It might also be helpful
57 to add units that represent a reduced pronunciation to be used in the
58 \emph{reparandum}. When \emph{diphone synthesis} is used there do not
59 have to be big changes to be applied since most likely the diphone
60 combinations already exist in the database.
61 \end{enumerate}