asr.tex

   1 %&asr
   2 \usepackage[nonumberlist,acronyms]{glossaries}
   3 \makeglossaries%
   4 \newacronym{ANN}{ANN}{Artificial Neural Network}
   5 \newacronym{HMM}{HMM}{Hidden Markov Model}
   6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
   7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
   8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
   9 \newacronym{FA}{FA}{Forced alignment}
  10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
  11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
  12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
  13 \newglossaryentry{dm}{name={Death Metal},
  14         description={is an extreme heavy metal music style with growling vocals and
  15         pounding drums}}
  16
  17 \begin{document}
  18 \frontmatter{}
  19
  20 \maketitleru[
  21         course={(Automatic) Speech Recognition},
  22         institute={Radboud University Nijmegen},
  23         authorstext={Author:},
  24         pagenr=1]
  25 \listoftodos[Todo]
  26
  27 \tableofcontents
  28
  29 %Glossaries
  30 %\glsaddall{}
  31 %\printglossaries
  32
  33 \mainmatter{}
  34 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
  35 %detector for singing lines.  They achive 80\% accuracy for forty 15 second
  36 %exerpts. They mention people that wrote signal features that discriminate
  37 %between speech and music. Neural net
  38 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
  39 %
  40 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
  41 %polyphonic turkish music, this might be interesting to use for heavy metal.
  42 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
  43 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
  44 %detection, then melody extraction, then alignment. They compare results with
  45 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
  46 %specialize in long syllables in a capella. They use \glspl{DHMM} with
  47 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
  48 %has long syllables)~\cite{dzhambazov_automatic_2016}.
  49 %
  50
  51
  52 %Introduction, leading to a clearly defined research question
  53 \chapter{Introduction}
  54 \section{Introduction}
  55 The \gls{IFPI} stated that about $43\%$ of music revenue rises from digital
  56 distribution. The overtake on physical formats took place somewhere in 2015 and
  57 since twenty years the music industry has seen significant
  58 growth~\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  59
  60 A lot of this musical distribution goes via non-official channels such as
  61 YouTube~\footnote{\url{https://youtube.com}} in which fans of the musical group
  62 accompany the music with synchronized lyrics so that users can sing or read
  63 along. Because of this interest it is very useful to device automatic
  64 techniques for segmenting instrumental and vocal parts of a song and
  65 apply forced alignment or even lyrics recognition on the audio file.
  66
  67 Such techniques are heavily researched and working systems have been created.
  68 However, these techniques are designed to detect a clean singing voice. Extreme
  69 genres such as \gls{dm} are using more extreme vocal techniques such as
  70 grunting or growling. It must be noted that grunting is not a technique only
  71 used in extreme metal styles. Similar or equal techniques have been used in
  72 \emph{Beijing opera}, Japanese \emph{Noh} and but also more western styles like
  73 jazz singing by Louis Armstrong~\cite{sakakibara_growl_2004}. It might even be
  74 traced back to viking times. An arab merchant wrote in the tenth
  75 century~\cite{friis_vikings_2004}:
  76
  77 \begin{displayquote}
  78         Never before I have heard uglier songs than those of the Vikings in
  79         Slesvig. The growling sound coming from their throats reminds me of dogs
  80         howling, only more untamed.
  81 \end{displayquote}
  82
  83 %A majority of the music is not only instrumental but also contains vocal
  84 %segments.
  85 %
  86 %Music is a leading type of data distributed on the internet. Regular music
  87 %distribution is almost entirely digital and services like Spotify and YouTube
  88 %allow one to listen to almost any song within a few clicks. Moreover, there are
  89 %myriads of websites offering lyrics of songs.
  90 %
  91 %\todo{explain relevancy, (preprocessing for lyric alignment)}
  92 %
  93 %This leads to the following research question:
  94 %\begin{center}\em%
  95 %       Are standard \gls{ANN} based techniques for singing voice detection
  96 %       suitable for non-standard musical genres like Death metal.
  97 %\end{center}
  98
  99 %Literature overview / related work
 100 \section{Related work}
 101 The field of applying standard speech processing techniques on music started in
 102 the late 90s~\cite{saunders_real-time_1996,scheirer_construction_1997} and it
 103 was found that music has different discriminating features compared to normal
 104 speech.
 105
 106 Berenzweig and Ellis expanded on the aforementioned research by trying to
 107 separate singing from instrumental music\cite{berenzweig_locating_2001}.
 108
 109 \todo{Incorporate this in literary framing}%
 110 ~\cite{fujihara_automatic_2006}%
 111 ~\cite{fujihara_lyricsynchronizer:_2011}%
 112 ~\cite{fujihara_three_2008}%
 113 ~\cite{mauch_integrating_2012}%
 114 ~\cite{mesaros_adaptation_2009}%
 115 ~\cite{mesaros_automatic_2008}%
 116 ~\cite{mesaros_automatic_2010}%
 117 ~%\cite{muller_multimodal_2012}%
 118 ~\cite{pedone_phoneme-level_2011}%
 119 ~\cite{yang_machine_2012}%
 120
 121
 122
 123 \section{Research question}
 124 It is discutable whether the aforementioned techniques work because the
 125 spectral properties of a growling voice is different from the spectral
 126 properties of a clean singing voice. It has been found that growling voices
 127 have less prominent peaks in the frequency representation and are closer to
 128 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
 129 research question:
 130
 131 \begin{center}\em%
 132         Are standard \gls{ANN} based techniques for singing voice detection
 133         suitable for non-standard musical genres like \gls{dm}.
 134 \end{center}
 135
 136 \chapter{Methods}
 137 %Methodology
 138
 139 %Experiment(s) (set-up, data, results, discussion)
 140 \section{Data \& Preprocessing}
 141 To run the experiments data has been collected from several \gls{dm} albums.
 142 The exact data used is available in Appendix~\ref{app:data}. The albums are
 143 extracted from the audio CD and converted to a mono channel waveform with the
 144 correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
 145 When the waveforms are finished they are converted to \glspl{MFCC} vectors
 146 using the \emph{python\_speech\_features}%
 147 ~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
 148 All these steps combined results in thirteen tab separated features per line in
 149 a file for every source file. Every file is annotated using
 150 Praat~\cite{boersma_praat_2002} where the utterances are manually aligned to
 151 the audio. An example of an utterances are shown in
 152 Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
 153 within the genre of death metal there are a lot of different spectral patterns
 154 visible.
 155
 156 \begin{figure}[ht]
 157         \centering
 158         \includegraphics[width=.7\linewidth]{cement}
 159         \caption{A vocal segment of the \emph{Cannibal Corpse} song
 160                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
 161 \end{figure}
 162
 163 \begin{figure}[ht]
 164         \centering
 165         \includegraphics[width=.7\linewidth]{abominations}
 166         \caption{A vocal segment of the \emph{Disgorge} song
 167                 \emph{Enthroned Abominations}}\label{fig:abominations}
 168 \end{figure}
 169
 170 The data is collected from two\todo{more in the future}\ studio albums. The first
 171 band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
 172 25 years and have been creating the same type every album. The singer of
 173 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
 174 comprehensible. The second band is called \emph{Disgorge} and make even more
 175 violent music. The growls of the lead singer sound more like a coffee grinder
 176 and are more shallow. The lyrics are completely incomprehensible and therefore
 177 some parts are not annotated with lyrics because it was too difficult to hear
 178 what was being sung.
 179
 180 \section{Methods}
 181 \todo{To remove in final thesis}
 182 The initial planning is still up to date. About one and a half album has been
 183 annotated and a framework for setting up experiments has been created.
 184 Moreover, the first exploratory experiments are already been executed and
 185 promising. In April the experimental dataset will be expanded and I will try to
 186 mimic some of the experiments done in the literature to see whether it performs
 187 similar on Death Metal
 188 \begin{table}[ht]
 189         \centering
 190         \begin{tabular}{cll}
 191                 \toprule
 192                 Month & Description\\
 193                 \midrule
 194                 March
 195                         & Preparing the data\\
 196                         & Preparing an experiment platform\\
 197                         & Literature research\\
 198                 April
 199                         & Running the experiments\\
 200                         & Fiddle with parameters\\
 201                         & Explore the possibilities for forced alignment\\
 202                 May
 203                         & Write up the thesis\\
 204                         & Possibly do forced alignment\\
 205                 June
 206                         & Finish up thesis\\
 207                         & Wrap up\\
 208                 \bottomrule
 209         \end{tabular}
 210         \caption{Outline}
 211 \end{table}
 212
 213 \todo{Explain why MFCC and which parameters}
 214 \todo{Spectrals might be enough, no decorrelation}
 215
 216 \section{Experiments}
 217
 218 \section{Results}
 219
 220
 221 \chapter{Conclusion \& Discussion}
 222 %Discussion section
 223 \todo{Novelty}
 224 \todo{Weaknesses}
 225 \todo{Dataset is not very varied but\ldots}
 226
 227 \todo{Doom metal}
 228 %Conclusion section
 229 %Acknowledgements
 230 %Statement on authors' contributions
 231 %(Appendices)
 232 \appendix
 233 \chapter{Experimental data}\label{app:data}
 234 \begin{table}[h]
 235         \centering
 236         \begin{tabular}{cllll}
 237                 \toprule
 238                 Num. & Artist & Album & Song & Duration\\
 239                 \midrule
 240                 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
 241                 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
 242                 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
 243                 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
 244                 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
 245                 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
 246                 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
 247                 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
 248                 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
 249                 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
 250                 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
 251                 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
 252                 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
 253                 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
 254                 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
 255                 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
 256                 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
 257                 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
 258                 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
 259                 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
 260                 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
 261                 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
 262                 \bottomrule
 263         \end{tabular}
 264         \caption{Songs used in the experiments}
 265 \end{table}
 266
 267 \bibliographystyle{ieeetr}
 268 \bibliography{asr}
 269 \end{document}