asr.tex

   1 %&asr
   2 \usepackage[nonumberlist,acronyms]{glossaries}
   3 %\makeglossaries%
   4 \newacronym{ANN}{ANN}{Artificial Neural Network}
   5 \newacronym{HMM}{HMM}{Hidden Markov Model}
   6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
   7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
   8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
   9 \newacronym{FA}{FA}{Forced alignment}
  10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
  11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
  12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
  13 \newglossaryentry{dm}{name={Death Metal},
  14         description={is an extreme heavy metal music style with growling vocals and
  15         pounding drums}}
  16 \newglossaryentry{dom}{name={Doom Metal},
  17         description={is an extreme heavy metal music style with growling vocals and
  18         pounding drums played very slowly}}
  19
  20 \begin{document}
  21 \frontmatter{}
  22
  23 \maketitleru[
  24         course={(Automatic) Speech Recognition},
  25         institute={Radboud University Nijmegen},
  26         authorstext={Author:},
  27         pagenr=1]
  28 \listoftodos[Todo]
  29
  30 \tableofcontents
  31
  32 %Glossaries
  33 %\glsaddall{}
  34 %\printglossaries
  35
  36 \mainmatter{}
  37 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
  38 %detector for singing lines.  They achive 80\% accuracy for forty 15 second
  39 %exerpts. They mention people that wrote signal features that discriminate
  40 %between speech and music. Neural net
  41 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
  42 %
  43 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
  44 %polyphonic turkish music, this might be interesting to use for heavy metal.
  45 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
  46 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
  47 %detection, then melody extraction, then alignment. They compare results with
  48 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
  49 %specialize in long syllables in a capella. They use \glspl{DHMM} with
  50 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
  51 %has long syllables)~\cite{dzhambazov_automatic_2016}.
  52 %
  53
  54
  55 %Introduction, leading to a clearly defined research question
  56 \chapter{Introduction}
  57 \section{Introduction}
  58 The primary medium for music distribution is rapidly changing from physical
  59 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
  60 revenue rises from digital distribution. Another $39\%$ arises from the
  61 physical sale and the remaining $16\%$ is made through performance and
  62 synchronisation revenieus. The overtake of digital formats on physical formats
  63 took place somewhere in 2015. Moreover, ever since twenty years the music
  64 industry has seen significant growth
  65 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  66
  67 There has always been an interest in lyrics to music alignment to be used in
  68 for example karaoke. As early as in the late 1980s karaoke machines were
  69 available for consumers. While the lyrics for the track are almost always
  70 available, a alignment is not and it involves manual labour to create such an
  71 alignment.
  72
  73 A lot of this musical distribution goes via non-official channels such as
  74 YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
  75 often accompany the music with synchronized lyrics. This means that there is an
  76 enormous treasure of lyrics-annotated music available but not within our reach
  77 since the subtitles are almost always hardcoded into the video stream and thus
  78 not directly usable as data. Because of this interest it is very useful to
  79 device automatic techniques for segmenting instrumental and vocal parts of a
  80 song, apply forced alignment or even lyrics recognition on the audio file.
  81
  82 Such techniques are heavily researched and working systems have been created.
  83 However, these techniques are designed to detect a clean singing voice and have
  84 not been testen on so-called \emph{extended vocal techniques} such as grunting
  85 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
  86 but it must be noted that grunting is not a technique only used in extreme
  87 metal styles. Similar or equal techniques have been used in \emph{Beijing
  88 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
  89 by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
  90 to viking times. For example, an arab merchant visiting a village in Denmark
  91 wrote in the tenth century\cite{friis_vikings_2004}:
  92
  93 \begin{displayquote}
  94         Never before I have heard uglier songs than those of the Vikings in
  95         Slesvig. The growling sound coming from their throats reminds me of dogs
  96         howling, only more untamed.
  97 \end{displayquote}
  98
  99 \section{\gls{dm}}
 100
 101 %Literature overview / related work
 102 \section{Related work}
 103 The field of applying standard speech processing techniques on music started in
 104 the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
 105 was found that music has different discriminating features compared to normal
 106 speech.
 107
 108 Berenzweig and Ellis expanded on the aforementioned research by trying to
 109 separate singing from instrumental music\cite{berenzweig_locating_2001}.
 110
 111 \todo{Incorporate this in literary framing}%
 112 ~\cite{fujihara_automatic_2006}%
 113 ~\cite{fujihara_lyricsynchronizer:_2011}%
 114 ~\cite{fujihara_three_2008}%
 115 ~\cite{mauch_integrating_2012}%
 116 ~\cite{mesaros_adaptation_2009}%
 117 ~\cite{mesaros_automatic_2008}%
 118 ~\cite{mesaros_automatic_2010}%
 119 ~%\cite{muller_multimodal_2012}%
 120 ~\cite{pedone_phoneme-level_2011}%
 121 ~\cite{yang_machine_2012}%
 122
 123
 124
 125 \section{Research question}
 126 It is discutable whether the aforementioned techniques work because the
 127 spectral properties of a growling voice is different from the spectral
 128 properties of a clean singing voice. It has been found that growling voices
 129 have less prominent peaks in the frequency representation and are closer to
 130 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
 131 research question:
 132
 133 \begin{center}\em%
 134         Are standard \gls{ANN} based techniques for singing voice detection
 135         suitable for non-standard musical genres like \gls{dm}.
 136 \end{center}
 137
 138 \chapter{Methods}
 139 %Methodology
 140
 141 %Experiment(s) (set-up, data, results, discussion)
 142 \section{Data \& Preprocessing}
 143 To run the experiments data has been collected from several \gls{dm} albums.
 144 The exact data used is available in Appendix~\ref{app:data}. The albums are
 145 extracted from the audio CD and converted to a mono channel waveform with the
 146 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
 147 Every file is annotated using
 148 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
 149 the audio. Examples of utterances are shown in
 150 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
 151 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
 152 that within the genre of death metal there are a different spectral patterns
 153 visible.
 154
 155 \begin{figure}[ht]
 156         \centering
 157         \includegraphics[width=.7\linewidth]{cement}
 158         \caption{A vocal segment of the \emph{Cannibal Corpse} song
 159                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
 160 \end{figure}
 161
 162 \begin{figure}[ht]
 163         \centering
 164         \includegraphics[width=.7\linewidth]{abominations}
 165         \caption{A vocal segment of the \emph{Disgorge} song
 166                 \emph{Enthroned Abominations}}\label{fig:abominations}
 167 \end{figure}
 168
 169 The data is collected from three studio albums. The
 170 first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
 171 almost 25 years and have been creating the same type every album. The singer of
 172 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
 173 comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
 174 regular shouting.
 175
 176 The second band is called \emph{Disgorge} and make even more violently sounding
 177 music. The growls of the lead singer sound like a coffee grinder and are more
 178 shallow. In the spectrals it is clearly visible that there are overtones
 179 produced during some parts of the growling. The lyrics are completely
 180 incomprehensible and therefore some parts were not annotated with the actual
 181 lyrics because it was not possible what was being sung.
 182
 183 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
 184 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
 185 bands because they create \gls{dom}. \gls{dom} is characterized by the very
 186 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
 187 and performs in several moscovian bands. This band also stands out because it
 188 uses piano's and synthesizers. The droning synthesizers often operate in the
 189 same frequency as the vocals.
 190
 191 \section{\gls{MFCC} Features}
 192 The waveforms are converted to \glspl{MFCC} feature vectors using the
 193 \emph{python\_speech\_features}%
 194 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
 195 All these steps combined results in thirteen tab separated features per line in
 196 a file for every source file. Technical info about the processing steps is
 197 given in the following sections.
 198
 199 \todo{Explain why MFCC and which parameters}
 200
 201 \section{\gls{ANN} Classifier}
 202 \todo{Spectrals might be enough, no decorrelation}
 203
 204 \section{Model training}
 205
 206 \section{Experiments}
 207
 208 \section{Results}
 209
 210
 211 \chapter{Conclusion \& Discussion}
 212 %Discussion section
 213 \todo{Novelty}
 214 \todo{Weaknesses}
 215 \todo{Dataset is not very varied but\ldots}
 216
 217 \todo{Doom metal}
 218 %Conclusion section
 219 %Acknowledgements
 220 %Statement on authors' contributions
 221 %(Appendices)
 222 \appendix
 223 \chapter{Experimental data}\label{app:data}
 224 \begin{table}[h]
 225         \centering
 226         \begin{tabular}{cllll}
 227                 \toprule
 228                 Num. & Artist & Album & Song & Duration\\
 229                 \midrule
 230                 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
 231                 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
 232                 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
 233                 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
 234                 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
 235                 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
 236                 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
 237                 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
 238                 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
 239                 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
 240                 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
 241                 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
 242                 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
 243                 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
 244                 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
 245                 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
 246                 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
 247                 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
 248                 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
 249                 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
 250                 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
 251                 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
 252                 22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\
 253                 23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\
 254                 24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\
 255                 25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\
 256                 26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\
 257                 27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\
 258                 28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\
 259                 \midrule
 260                 & & & Total: & 02:13:40\\
 261                 \bottomrule
 262         \end{tabular}
 263         \caption{Songs used in the experiments}
 264 \end{table}
 265
 266 \bibliographystyle{ieeetr}
 267 \bibliography{asr}
 268 \end{document}