exam/q3.tex

   1 \begin{enumerate}
   2         % Question 3a
   3         \item In an \emph{ASR} system we can expect problems in several phases.
   4
   5                 The first phase of an \emph{ASR} is just extracting the features. We
   6                 do not expect problems there since it will just produce slightly
   7                 different features for some part but that is not something the feature
   8                 extraction cares about. It just objectively has to extract features and
   9                 since it is still human speech, there are no problem with disfluencies.
  10
  11                 When trying to transform the cepstral features into a sentence several
  12                 components are involved. First a phone likelihood is calculated, we
  13                 maybe expect slight problems here since even the phones might be
  14                 reduced and the \emph{editing phase} words could just be rudimentary
  15                 sounds instead of phones and thus it might select suboptimal
  16                 likelihoods.
  17
  18                 When decoding the phone likelihood into words a lexicon is used. This
  19                 lexicon might not contain the edit words and possibly also not the
  20                 reduced \emph{reparanda}.
  21
  22                 Finally during \emph{Viterbi} \emph{N-Gram} models come into play and
  23                 if they are not extracted from a dataset that also included
  24                 disfluencies it might be the case that the probabilities of
  25                 disfluencies appearing are so low that it tries to fit similarly
  26                 sounding real words instead of the disfluency.
  27
  28         % Question 3b
  29         \item Solving the problem of phone likelihood computation can be done by
  30                 shrinking the window of the feature extraction so that strongly reduced
  31                 phones are also correctly recognized.
  32
  33                 Solving the second problem can be done by adding disfluency words to
  34                 the lexicon and also more reductive pronunciations of words.
  35
  36                 Lastly we can increase the decoding performance by specifically
  37                 extracting the \emph{N-Gram} probabilities from data that also contains
  38                 disfluencies.
  39
  40         % Question 3c
  41         \item To add disfluencies to speech synthesis one must know how they arise.
  42                 There are some word categories that have more disfluencies than others.
  43                 Also they may be produced to give speaker some more time to think about
  44                 the rest of the sentence. When you know such properties of disfluencies
  45                 you can model them in the speech synthesis in the normalization phase.
  46
  47                 In the normalization phase the system can add disfluencies at sections
  48                 that often produce them. This most likely is the most effective in
  49                 tokenisation. Specific tokens that can be selected to be expanded to a
  50                 disfluency.
  51
  52                 Later on in the pipeline the system must also be adapted. Namely in the
  53                 waveform synthesis. Depending on the technique applied some
  54                 improvements can be done. When the synthesis technique is unit
  55                 selection it might be helpful to have units for common disfluencies and
  56                 at least units for \emph{editing phase} words. It might also be helpful
  57                 to add units that represent a reduced pronunciation to be used in the
  58                 \emph{reparandum}. When \emph{diphone synthesis} is used there do not
  59                 have to be big changes to be applied since most likely the diphone
  60                 combinations already exist in the database.
  61 \end{enumerate}