asr.tex

   1 %&asr
   2 \usepackage[nonumberlist,acronyms]{glossaries}
   3 %\makeglossaries%
   4 \newacronym{ANN}{ANN}{Artificial Neural Network}
   5 \newacronym{HMM}{HMM}{Hidden Markov Model}
   6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
   7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
   8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
   9 \newacronym{FA}{FA}{Forced alignment}
  10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
  11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
  12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
  13 \newglossaryentry{dm}{name={Death Metal},
  14         description={is an extreme heavy metal music style with growling vocals and
  15         pounding drums}}
  16 \newglossaryentry{dom}{name={Doom Metal},
  17         description={is an extreme heavy metal music style with growling vocals and
  18         pounding drums played very slowly}}
  19 \newglossaryentry{FT}{name={Fourier Transform},
  20         description={is a technique of converting a time representation signal to a
  21         frequency representation}}
  22 \newglossaryentry{MS}{name={Mel-Scale},
  23         description={is a human ear inspired scale for spectral signals.}}
  24
  25 \begin{document}
  26 \frontmatter{}
  27
  28 \maketitleru[
  29         course={(Automatic) Speech Recognition},
  30         institute={Radboud University Nijmegen},
  31         authorstext={Author:},
  32         pagenr=1]
  33 \listoftodos[Todo]
  34
  35 \tableofcontents
  36
  37 %Glossaries
  38 %\glsaddall{}
  39 %\printglossaries
  40
  41 \mainmatter{}
  42 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
  43 %detector for singing lines.  They achive 80\% accuracy for forty 15 second
  44 %exerpts. They mention people that wrote signal features that discriminate
  45 %between speech and music. Neural net
  46 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
  47 %
  48 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
  49 %polyphonic turkish music, this might be interesting to use for heavy metal.
  50 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
  51 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
  52 %detection, then melody extraction, then alignment. They compare results with
  53 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
  54 %specialize in long syllables in a capella. They use \glspl{DHMM} with
  55 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
  56 %has long syllables)~\cite{dzhambazov_automatic_2016}.
  57 %
  58
  59
  60 %Introduction, leading to a clearly defined research question
  61 \chapter{Introduction}
  62 \section{Introduction}
  63 The primary medium for music distribution is rapidly changing from physical
  64 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
  65 revenue rises from digital distribution. Another $39\%$ arises from the
  66 physical sale and the remaining $16\%$ is made through performance and
  67 synchronisation revenieus. The overtake of digital formats on physical formats
  68 took place somewhere in 2015. Moreover, ever since twenty years the music
  69 industry has seen significant growth
  70 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  71
  72 There has always been an interest in lyrics to music alignment to be used in
  73 for example karaoke. As early as in the late 1980s karaoke machines were
  74 available for consumers. While the lyrics for the track are almost always
  75 available, a alignment is not and it involves manual labour to create such an
  76 alignment.
  77
  78 A lot of this musical distribution goes via non-official channels such as
  79 YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
  80 often accompany the music with synchronized lyrics. This means that there is an
  81 enormous treasure of lyrics-annotated music available but not within our reach
  82 since the subtitles are almost always hardcoded into the video stream and thus
  83 not directly usable as data. Because of this interest it is very useful to
  84 device automatic techniques for segmenting instrumental and vocal parts of a
  85 song, apply forced alignment or even lyrics recognition on the audio file.
  86
  87 Such techniques are heavily researched and working systems have been created.
  88 However, these techniques are designed to detect a clean singing voice and have
  89 not been testen on so-called \emph{extended vocal techniques} such as grunting
  90 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
  91 but it must be noted that grunting is not a technique only used in extreme
  92 metal styles. Similar or equal techniques have been used in \emph{Beijing
  93 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
  94 by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
  95 to viking times. For example, an arab merchant visiting a village in Denmark
  96 wrote in the tenth century\cite{friis_vikings_2004}:
  97
  98 \begin{displayquote}
  99         Never before I have heard uglier songs than those of the Vikings in
 100         Slesvig. The growling sound coming from their throats reminds me of dogs
 101         howling, only more untamed.
 102 \end{displayquote}
 103
 104 \section{\gls{dm}}
 105
 106 %Literature overview / related work
 107 \section{Related work}
 108 The field of applying standard speech processing techniques on music started in
 109 the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
 110 was found that music has different discriminating features compared to normal
 111 speech.
 112
 113 Berenzweig and Ellis expanded on the aforementioned research by trying to
 114 separate singing from instrumental music\cite{berenzweig_locating_2001}.
 115
 116 \todo{Incorporate this in literary framing}%
 117 ~\cite{fujihara_automatic_2006}%
 118 ~\cite{fujihara_lyricsynchronizer:_2011}%
 119 ~\cite{fujihara_three_2008}%
 120 ~\cite{mauch_integrating_2012}%
 121 ~\cite{mesaros_adaptation_2009}%
 122 ~\cite{mesaros_automatic_2008}%
 123 ~\cite{mesaros_automatic_2010}%
 124 ~%\cite{muller_multimodal_2012}%
 125 ~\cite{pedone_phoneme-level_2011}%
 126 ~\cite{yang_machine_2012}%
 127
 128
 129
 130 \section{Research question}
 131 It is discutable whether the aforementioned techniques work because the
 132 spectral properties of a growling voice is different from the spectral
 133 properties of a clean singing voice. It has been found that growling voices
 134 have less prominent peaks in the frequency representation and are closer to
 135 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
 136 research question:
 137
 138 \begin{center}\em%
 139         Are standard \gls{ANN} based techniques for singing voice detection
 140         suitable for non-standard musical genres like \gls{dm}.
 141 \end{center}
 142
 143 \chapter{Methods}
 144 %Methodology
 145
 146 %Experiment(s) (set-up, data, results, discussion)
 147 \section{Data \& Preprocessing}
 148 To run the experiments data has been collected from several \gls{dm} albums.
 149 The exact data used is available in Appendix~\ref{app:data}. The albums are
 150 extracted from the audio CD and converted to a mono channel waveform with the
 151 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
 152 Every file is annotated using
 153 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
 154 the audio. Examples of utterances are shown in
 155 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
 156 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
 157 that within the genre of death metal there are a different spectral patterns
 158 visible.
 159
 160 \begin{figure}[ht]
 161         \centering
 162         \includegraphics[width=.7\linewidth]{cement}
 163         \caption{A vocal segment of the \emph{Cannibal Corpse} song
 164                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
 165 \end{figure}
 166
 167 \begin{figure}[ht]
 168         \centering
 169         \includegraphics[width=.7\linewidth]{abominations}
 170         \caption{A vocal segment of the \emph{Disgorge} song
 171                 \emph{Enthroned Abominations}}\label{fig:abominations}
 172 \end{figure}
 173
 174 The data is collected from three studio albums. The
 175 first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
 176 almost 25 years and have been creating the same type every album. The singer of
 177 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
 178 comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
 179 regular shouting.
 180
 181 The second band is called \emph{Disgorge} and make even more violently sounding
 182 music. The growls of the lead singer sound like a coffee grinder and are more
 183 shallow. In the spectrals it is clearly visible that there are overtones
 184 produced during some parts of the growling. The lyrics are completely
 185 incomprehensible and therefore some parts were not annotated with the actual
 186 lyrics because it was not possible what was being sung.
 187
 188 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
 189 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
 190 bands because they create \gls{dom}. \gls{dom} is characterized by the very
 191 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
 192 and performs in several moscovian bands. This band also stands out because it
 193 uses piano's and synthesizers. The droning synthesizers often operate in the
 194 same frequency as the vocals.
 195
 196 \section{\gls{MFCC} Features}
 197 The waveforms in itself are not very suitable to be used as features due to the
 198 high dimensionality and correlation. Therefore we use the aften used
 199 \glspl{MFCC} feature vectors.\todo{cite which papers use this} The actual
 200 conversion is done using the \emph{python\_speech\_features}%
 201 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
 202
 203 \gls{MFCC} features are nature inspired and built incrementally in a several of
 204 steps.
 205 \begin{enumerate}
 206         \item The first step in the process is converting the time representation
 207                 of the signal to a spectral representation using a sliding window with
 208                 overlap. The width of the window and the step size are two important
 209                 parameters in the system. In classical phonetic analysis window sizes
 210                 of $25ms$ with a step of $10ms$ are often chosen because they are small
 211                 enough to only contain subphone entities. Singing for $25ms$ is
 212                 impossible so it is arguable that the window size is very small.
 213         \item The standard \gls{FT} gives a spectral representation that has
 214                 linearly scaled frequencies. This scale is converted to the \gls{MS}
 215                 using triangular overlapping windows.
 216         \item
 217 \end{enumerate}
 218
 219
 220 \todo{Explain why MFCC and which parameters}
 221
 222 \section{\gls{ANN} Classifier}
 223 \todo{Spectrals might be enough, no decorrelation}
 224
 225 \section{Model training}
 226
 227 \section{Experiments}
 228
 229 \section{Results}
 230
 231
 232 \chapter{Conclusion \& Discussion}
 233 \section{Conclusion}
 234 %Discussion section
 235
 236 \section{Discussion}
 237
 238 \todo{Novelty}
 239 \todo{Weaknesses}
 240 \todo{Dataset is not very varied but\ldots}
 241
 242 \todo{Doom metal}
 243 %Conclusion section
 244 %Acknowledgements
 245 %Statement on authors' contributions
 246 %(Appendices)
 247 \appendix
 248 \chapter{Experimental data}\label{app:data}
 249 \begin{table}[h]
 250         \centering
 251         \begin{tabular}{cllll}
 252                 \toprule
 253                 Num. & Artist & Album & Song & Duration\\
 254                 \midrule
 255                 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
 256                 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
 257                 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
 258                 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
 259                 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
 260                 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
 261                 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
 262                 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
 263                 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
 264                 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
 265                 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
 266                 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
 267                 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
 268                 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
 269                 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
 270                 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
 271                 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
 272                 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
 273                 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
 274                 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
 275                 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
 276                 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
 277                 22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\
 278                 23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\
 279                 24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\
 280                 25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\
 281                 26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\
 282                 27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\
 283                 28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\
 284                 \midrule
 285                 & & & Total: & 02:13:40\\
 286                 \bottomrule
 287         \end{tabular}
 288         \caption{Songs used in the experiments}
 289 \end{table}
 290
 291 \bibliographystyle{ieeetr}
 292 \bibliography{asr}
 293 \end{document}