asr.tex

   1 %&asr
   2 \usepackage[nonumberlist,acronyms]{glossaries}
   3 \makeglossaries%
   4 \newacronym{ANN}{ANN}{Artificial Neural Network}
   5 \newacronym{HMM}{HMM}{Hidden Markov Model}
   6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
   7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
   8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
   9 \newacronym{FA}{FA}{Forced alignment}
  10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
  11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
  12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
  13 \newglossaryentry{dm}{name={Death Metal},
  14         description={is an extreme heavy metal music style with growling vocals and
  15         pounding drums}}
  16
  17 \begin{document}
  18 \frontmatter{}
  19
  20 \maketitleru[
  21         course={(Automatic) Speech Recognition},
  22         institute={Radboud University Nijmegen},
  23         authorstext={Author:},
  24         pagenr=1]
  25 \listoftodos[Todo]
  26
  27 \tableofcontents
  28
  29 %Glossaries
  30 %\glsaddall{}
  31 %\printglossaries
  32
  33 \mainmatter{}
  34 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
  35 %detector for singing lines.  They achive 80\% accuracy for forty 15 second
  36 %exerpts. They mention people that wrote signal features that discriminate
  37 %between speech and music. Neural net
  38 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
  39 %
  40 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
  41 %polyphonic turkish music, this might be interesting to use for heavy metal.
  42 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
  43 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
  44 %detection, then melody extraction, then alignment. They compare results with
  45 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
  46 %specialize in long syllables in a capella. They use \glspl{DHMM} with
  47 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
  48 %has long syllables)~\cite{dzhambazov_automatic_2016}.
  49 %
  50
  51
  52 %Introduction, leading to a clearly defined research question
  53 \chapter{Introduction}
  54 \section{Introduction}
  55 The primary medium for music distribution is rapidly changing from physical
  56 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
  57 revenue rises from digital distribution. Another $39\%$ arises from the
  58 physical sale and the remaining $16\%$ is made through performance and
  59 synchronisation revenieus. The overtake of digital formats on physical formats
  60 took place somewhere in 2015. Moreover, ever since twenty years the music
  61 industry has seen significant growth
  62 again~\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  63
  64 There has always been an interest in lyrics to music alignment to be used in
  65 for example karaoke. As early as in the late 1980s karaoke machines were
  66 available for consumers. While the lyrics for the track are almost always
  67 available, a alignment is not and it involves manual labour to create such an
  68 alignment.
  69
  70 A lot of this musical distribution goes via non-official channels such as
  71 YouTube~\footnote{\url{https://youtube.com}} in which fans of the performers
  72 often accompany the music with synchronized lyrics. This means that there is an
  73 enormous treasure of lyrics-annotated music available but not within our reach
  74 since the subtitles are almost always hardcoded into the video stream and thus
  75 not directly usable as data. Because of this interest it is very useful to
  76 device automatic techniques for segmenting instrumental and vocal parts of a
  77 song, apply forced alignment or even lyrics recognition on the audio file.
  78
  79 Such techniques are heavily researched and working systems have been created.
  80 However, these techniques are designed to detect a clean singing voice and have
  81 not been testen on so-called \emph{extended vocal techniques} such as grunting
  82 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
  83 but it must be noted that grunting is not a technique only used in extreme
  84 metal styles. Similar or equal techniques have been used in \emph{Beijing
  85 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
  86 by Louis Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back
  87 to viking times. For example, an arab merchant visiting a village in Denmark
  88 wrote in the tenth century~\cite{friis_vikings_2004}:
  89
  90 \begin{displayquote}
  91         Never before I have heard uglier songs than those of the Vikings in
  92         Slesvig. The growling sound coming from their throats reminds me of dogs
  93         howling, only more untamed.
  94 \end{displayquote}
  95
  96 \section{\gls{dm}}
  97
  98 %Literature overview / related work
  99 \section{Related work}
 100 The field of applying standard speech processing techniques on music started in
 101 the late 90s~\cite{saunders_real-time_1996,scheirer_construction_1997} and it
 102 was found that music has different discriminating features compared to normal
 103 speech.
 104
 105 Berenzweig and Ellis expanded on the aforementioned research by trying to
 106 separate singing from instrumental music\cite{berenzweig_locating_2001}.
 107
 108 \todo{Incorporate this in literary framing}%
 109 ~\cite{fujihara_automatic_2006}%
 110 ~\cite{fujihara_lyricsynchronizer:_2011}%
 111 ~\cite{fujihara_three_2008}%
 112 ~\cite{mauch_integrating_2012}%
 113 ~\cite{mesaros_adaptation_2009}%
 114 ~\cite{mesaros_automatic_2008}%
 115 ~\cite{mesaros_automatic_2010}%
 116 ~%\cite{muller_multimodal_2012}%
 117 ~\cite{pedone_phoneme-level_2011}%
 118 ~\cite{yang_machine_2012}%
 119
 120
 121
 122 \section{Research question}
 123 It is discutable whether the aforementioned techniques work because the
 124 spectral properties of a growling voice is different from the spectral
 125 properties of a clean singing voice. It has been found that growling voices
 126 have less prominent peaks in the frequency representation and are closer to
 127 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
 128 research question:
 129
 130 \begin{center}\em%
 131         Are standard \gls{ANN} based techniques for singing voice detection
 132         suitable for non-standard musical genres like \gls{dm}.
 133 \end{center}
 134
 135 \chapter{Methods}
 136 %Methodology
 137
 138 %Experiment(s) (set-up, data, results, discussion)
 139 \section{Data \& Preprocessing}
 140 To run the experiments data has been collected from several \gls{dm} albums.
 141 The exact data used is available in Appendix~\ref{app:data}. The albums are
 142 extracted from the audio CD and converted to a mono channel waveform with the
 143 correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
 144 When the waveforms are finished they are converted to \glspl{MFCC} vectors
 145 using the \emph{python\_speech\_features}%
 146 ~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
 147 All these steps combined results in thirteen tab separated features per line in
 148 a file for every source file. Every file is annotated using
 149 Praat~\cite{boersma_praat_2002} where the utterances are manually aligned to
 150 the audio. Examples of utterances are shown in
 151 Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
 152 within the genre of death metal there are a different spectral patterns
 153 visible.
 154
 155 \begin{figure}[ht]
 156         \centering
 157         \includegraphics[width=.7\linewidth]{cement}
 158         \caption{A vocal segment of the \emph{Cannibal Corpse} song
 159                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
 160 \end{figure}
 161
 162 \begin{figure}[ht]
 163         \centering
 164         \includegraphics[width=.7\linewidth]{abominations}
 165         \caption{A vocal segment of the \emph{Disgorge} song
 166                 \emph{Enthroned Abominations}}\label{fig:abominations}
 167 \end{figure}
 168
 169 The data is collected from two\todo{more in the future}\ studio albums. The first
 170 band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
 171 25 years and have been creating the same type every album. The singer of
 172 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
 173 comprehensible. The second band is called \emph{Disgorge} and make even more
 174 violent music. The growls of the lead singer sound more like a coffee grinder
 175 and are more shallow. The lyrics are completely incomprehensible and therefore
 176 some parts are not annotated with lyrics because it was too difficult to hear
 177 what was being sung.
 178
 179 \section{Methods}
 180 \todo{To remove in final thesis}
 181 The initial planning is still up to date. About one and a half album has been
 182 annotated and a framework for setting up experiments has been created.
 183 Moreover, the first exploratory experiments are already been executed and
 184 promising. In April the experimental dataset will be expanded and I will try to
 185 mimic some of the experiments done in the literature to see whether it performs
 186 similar on Death Metal
 187 \begin{table}[ht]
 188         \centering
 189         \begin{tabular}{cll}
 190                 \toprule
 191                 Month & Description\\
 192                 \midrule
 193                 March
 194                         & Preparing the data\\
 195                         & Preparing an experiment platform\\
 196                         & Literature research\\
 197                 April
 198                         & Running the experiments\\
 199                         & Fiddle with parameters\\
 200                         & Explore the possibilities for forced alignment\\
 201                 May
 202                         & Write up the thesis\\
 203                         & Possibly do forced alignment\\
 204                 June
 205                         & Finish up thesis\\
 206                         & Wrap up\\
 207                 \bottomrule
 208         \end{tabular}
 209         \caption{Outline}
 210 \end{table}
 211
 212 \todo{Explain why MFCC and which parameters}
 213 \todo{Spectrals might be enough, no decorrelation}
 214
 215 \section{Experiments}
 216
 217 \section{Results}
 218
 219
 220 \chapter{Conclusion \& Discussion}
 221 %Discussion section
 222 \todo{Novelty}
 223 \todo{Weaknesses}
 224 \todo{Dataset is not very varied but\ldots}
 225
 226 \todo{Doom metal}
 227 %Conclusion section
 228 %Acknowledgements
 229 %Statement on authors' contributions
 230 %(Appendices)
 231 \appendix
 232 \chapter{Experimental data}\label{app:data}
 233 \begin{table}[h]
 234         \centering
 235         \begin{tabular}{cllll}
 236                 \toprule
 237                 Num. & Artist & Album & Song & Duration\\
 238                 \midrule
 239                 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
 240                 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
 241                 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
 242                 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
 243                 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
 244                 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
 245                 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
 246                 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
 247                 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
 248                 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
 249                 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
 250                 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
 251                 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
 252                 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
 253                 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
 254                 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
 255                 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
 256                 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
 257                 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
 258                 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
 259                 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
 260                 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
 261                 \bottomrule
 262         \end{tabular}
 263         \caption{Songs used in the experiments}
 264 \end{table}
 265
 266 \bibliographystyle{ieeetr}
 267 \bibliography{asr}
 268 \end{document}