asr.tex

   1 %&asr
   2 \usepackage[nonumberlist,acronyms]{glossaries}
   3 %\makeglossaries%
   4 \newacronym{ANN}{ANN}{Artificial Neural Network}
   5 \newacronym{HMM}{HMM}{Hidden Markov Model}
   6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
   7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
   8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
   9 \newacronym{FA}{FA}{Forced alignment}
  10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
  11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
  12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
  13 \newglossaryentry{dm}{name={Death Metal},
  14         description={is an extreme heavy metal music style with growling vocals and
  15         pounding drums}}
  16
  17 \begin{document}
  18 \frontmatter{}
  19
  20 \maketitleru[
  21         course={(Automatic) Speech Recognition},
  22         institute={Radboud University Nijmegen},
  23         authorstext={Author:},
  24         pagenr=1]
  25 \listoftodos[Todo]
  26
  27 \tableofcontents
  28
  29 %Glossaries
  30 %\glsaddall{}
  31 %\printglossaries
  32
  33 \mainmatter{}
  34 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
  35 %detector for singing lines.  They achive 80\% accuracy for forty 15 second
  36 %exerpts. They mention people that wrote signal features that discriminate
  37 %between speech and music. Neural net
  38 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
  39 %
  40 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
  41 %polyphonic turkish music, this might be interesting to use for heavy metal.
  42 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
  43 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
  44 %detection, then melody extraction, then alignment. They compare results with
  45 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
  46 %specialize in long syllables in a capella. They use \glspl{DHMM} with
  47 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
  48 %has long syllables)~\cite{dzhambazov_automatic_2016}.
  49 %
  50
  51
  52 %Introduction, leading to a clearly defined research question
  53 \chapter{Introduction}
  54 \section{Introduction}
  55 The primary medium for music distribution is rapidly changing from physical
  56 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
  57 revenue rises from digital distribution. Another $39\%$ arises from the
  58 physical sale and the remaining $16\%$ is made through performance and
  59 synchronisation revenieus. The overtake of digital formats on physical formats
  60 took place somewhere in 2015. Moreover, ever since twenty years the music
  61 industry has seen significant growth
  62 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  63
  64 There has always been an interest in lyrics to music alignment to be used in
  65 for example karaoke. As early as in the late 1980s karaoke machines were
  66 available for consumers. While the lyrics for the track are almost always
  67 available, a alignment is not and it involves manual labour to create such an
  68 alignment.
  69
  70 A lot of this musical distribution goes via non-official channels such as
  71 YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
  72 often accompany the music with synchronized lyrics. This means that there is an
  73 enormous treasure of lyrics-annotated music available but not within our reach
  74 since the subtitles are almost always hardcoded into the video stream and thus
  75 not directly usable as data. Because of this interest it is very useful to
  76 device automatic techniques for segmenting instrumental and vocal parts of a
  77 song, apply forced alignment or even lyrics recognition on the audio file.
  78
  79 Such techniques are heavily researched and working systems have been created.
  80 However, these techniques are designed to detect a clean singing voice and have
  81 not been testen on so-called \emph{extended vocal techniques} such as grunting
  82 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
  83 but it must be noted that grunting is not a technique only used in extreme
  84 metal styles. Similar or equal techniques have been used in \emph{Beijing
  85 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
  86 by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
  87 to viking times. For example, an arab merchant visiting a village in Denmark
  88 wrote in the tenth century\cite{friis_vikings_2004}:
  89
  90 \begin{displayquote}
  91         Never before I have heard uglier songs than those of the Vikings in
  92         Slesvig. The growling sound coming from their throats reminds me of dogs
  93         howling, only more untamed.
  94 \end{displayquote}
  95
  96 \section{\gls{dm}}
  97
  98 %Literature overview / related work
  99 \section{Related work}
 100 The field of applying standard speech processing techniques on music started in
 101 the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
 102 was found that music has different discriminating features compared to normal
 103 speech.
 104
 105 Berenzweig and Ellis expanded on the aforementioned research by trying to
 106 separate singing from instrumental music\cite{berenzweig_locating_2001}.
 107
 108 \todo{Incorporate this in literary framing}%
 109 ~\cite{fujihara_automatic_2006}%
 110 ~\cite{fujihara_lyricsynchronizer:_2011}%
 111 ~\cite{fujihara_three_2008}%
 112 ~\cite{mauch_integrating_2012}%
 113 ~\cite{mesaros_adaptation_2009}%
 114 ~\cite{mesaros_automatic_2008}%
 115 ~\cite{mesaros_automatic_2010}%
 116 ~%\cite{muller_multimodal_2012}%
 117 ~\cite{pedone_phoneme-level_2011}%
 118 ~\cite{yang_machine_2012}%
 119
 120
 121
 122 \section{Research question}
 123 It is discutable whether the aforementioned techniques work because the
 124 spectral properties of a growling voice is different from the spectral
 125 properties of a clean singing voice. It has been found that growling voices
 126 have less prominent peaks in the frequency representation and are closer to
 127 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
 128 research question:
 129
 130 \begin{center}\em%
 131         Are standard \gls{ANN} based techniques for singing voice detection
 132         suitable for non-standard musical genres like \gls{dm}.
 133 \end{center}
 134
 135 \chapter{Methods}
 136 %Methodology
 137
 138 %Experiment(s) (set-up, data, results, discussion)
 139 \section{Data \& Preprocessing}
 140 To run the experiments data has been collected from several \gls{dm} albums.
 141 The exact data used is available in Appendix~\ref{app:data}. The albums are
 142 extracted from the audio CD and converted to a mono channel waveform with the
 143 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
 144 When the waveforms are finished they are converted to \glspl{MFCC} vectors
 145 using the \emph{python\_speech\_features}%
 146 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
 147 All these steps combined results in thirteen tab separated features per line in
 148 a file for every source file. Technical info about the processing steps is
 149 given in the following sections. Every file is annotated using
 150 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
 151 the audio. Examples of utterances are shown in
 152 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
 153 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
 154 that within the genre of death metal there are a different spectral patterns
 155 visible.
 156
 157 \begin{figure}[ht]
 158         \centering
 159         \includegraphics[width=.7\linewidth]{cement}
 160         \caption{A vocal segment of the \emph{Cannibal Corpse} song
 161                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
 162 \end{figure}
 163
 164 \begin{figure}[ht]
 165         \centering
 166         \includegraphics[width=.7\linewidth]{abominations}
 167         \caption{A vocal segment of the \emph{Disgorge} song
 168                 \emph{Enthroned Abominations}}\label{fig:abominations}
 169 \end{figure}
 170
 171 The data is collected from two\todo{more in the future}\ studio albums. The first
 172 band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
 173 25 years and have been creating the same type every album. The singer of
 174 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
 175 comprehensible. The second band is called \emph{Disgorge} and make even more
 176 violent music. The growls of the lead singer sound more like a coffee grinder
 177 and are more shallow. The lyrics are completely incomprehensible and therefore
 178 some parts are not annotated with lyrics because it was too difficult to hear
 179 what was being sung.
 180
 181 \section{Methods}
 182 \todo{To remove in final thesis}
 183 The initial planning is still up to date. About one and a half album has been
 184 annotated and a framework for setting up experiments has been created.
 185 Moreover, the first exploratory experiments are already been executed and
 186 promising. In April the experimental dataset will be expanded and I will try to
 187 mimic some of the experiments done in the literature to see whether it performs
 188 similar on Death Metal
 189 \begin{table}[ht]
 190         \centering
 191         \begin{tabular}{cll}
 192                 \toprule
 193                 Month & Description\\
 194                 \midrule
 195                 March
 196                         & Preparing the data\\
 197                         & Preparing an experiment platform\\
 198                         & Literature research\\
 199                 April
 200                         & Running the experiments\\
 201                         & Fiddle with parameters\\
 202                         & Explore the possibilities for forced alignment\\
 203                 May
 204                         & Write up the thesis\\
 205                         & Possibly do forced alignment\\
 206                 June
 207                         & Finish up thesis\\
 208                         & Wrap up\\
 209                 \bottomrule
 210         \end{tabular}
 211         \caption{Outline}
 212 \end{table}
 213
 214 \section{Features}
 215
 216
 217 \todo{Explain why MFCC and which parameters}
 218 \todo{Spectrals might be enough, no decorrelation}
 219
 220 \section{Experiments}
 221
 222 \section{Results}
 223
 224
 225 \chapter{Conclusion \& Discussion}
 226 %Discussion section
 227 \todo{Novelty}
 228 \todo{Weaknesses}
 229 \todo{Dataset is not very varied but\ldots}
 230
 231 \todo{Doom metal}
 232 %Conclusion section
 233 %Acknowledgements
 234 %Statement on authors' contributions
 235 %(Appendices)
 236 \appendix
 237 \chapter{Experimental data}\label{app:data}
 238 \begin{table}[h]
 239         \centering
 240         \begin{tabular}{cllll}
 241                 \toprule
 242                 Num. & Artist & Album & Song & Duration\\
 243                 \midrule
 244                 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
 245                 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
 246                 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
 247                 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
 248                 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
 249                 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
 250                 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
 251                 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
 252                 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
 253                 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
 254                 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
 255                 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
 256                 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
 257                 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
 258                 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
 259                 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
 260                 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
 261                 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
 262                 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
 263                 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
 264                 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
 265                 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
 266                 \bottomrule
 267         \end{tabular}
 268         \caption{Songs used in the experiments}
 269 \end{table}
 270
 271 \bibliographystyle{ieeetr}
 272 \bibliography{asr}
 273 \end{document}