methods.tex

   1 %Methodology
   2
   3 %Experiment(s) (set-up, data, results, discussion)
   4 \section{Data \& Preprocessing}
   5 To run the experiments data has been collected from several \gls{dm} albums.
   6 The exact data used is available in Appendix~\ref{app:data}. The albums are
   7 extracted from the audio CD and converted to a mono channel waveform with the
   8 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
   9 Every file is annotated using
  10 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
  11 the audio. Examples of utterances are shown in
  12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
  13 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
  14 that within the genre of death metal there are a different spectral patterns
  15 visible.
  16
  17 \begin{figure}[ht]
  18         \centering
  19         \includegraphics[width=.7\linewidth]{cement}
  20         \caption{A vocal segment of the \emph{Cannibal Corpse} song
  21                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  22 \end{figure}
  23
  24 \begin{figure}[ht]
  25         \centering
  26         \includegraphics[width=.7\linewidth]{abominations}
  27         \caption{A vocal segment of the \emph{Disgorge} song
  28                 \emph{Enthroned Abominations}}\label{fig:abominations}
  29 \end{figure}
  30
  31 The data is collected from three studio albums. The
  32 first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
  33 almost 25 years and have been creating the same type every album. The singer of
  34 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
  35 comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
  36 regular shouting.
  37
  38 The second band is called \emph{Disgorge} and make even more violently sounding
  39 music. The growls of the lead singer sound like a coffee grinder and are more
  40 shallow. In the spectrals it is clearly visible that there are overtones
  41 produced during some parts of the growling. The lyrics are completely
  42 incomprehensible and therefore some parts were not annotated with the actual
  43 lyrics because it was not possible what was being sung.
  44
  45 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
  46 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
  47 bands because they create \gls{dom}. \gls{dom} is characterized by the very
  48 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
  49 and performs in several Muscovite bands. This band also stands out because it
  50 uses piano's and synthesizers. The droning synthesizers often operate in the
  51 same frequency as the vocals.
  52
  53 \section{\gls{MFCC} Features}
  54 The waveforms in itself are not very suitable to be used as features due to the
  55 high dimensionality and correlation. Therefore we use the often used
  56 \glspl{MFCC} feature vectors which has shown to be
  57 suitable\cite{rocamora_comparing_2007}. It has also been found that altering
  58 the mel scale to better suit singing does not yield a better
  59 performance\cite{you_comparative_2015}. The actual conversion is done using the
  60 \emph{python\_speech\_features}%
  61 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
  62
  63 \gls{MFCC} features are nature inspired and built incrementally in several
  64 steps.
  65 \begin{enumerate}
  66         \item The first step in the process is converting the time representation
  67                 of the signal to a spectral representation using a sliding window with
  68                 overlap. The width of the window and the step size are two important
  69                 parameters in the system. In classical phonetic analysis window sizes
  70                 of $25ms$ with a step of $10ms$ are often chosen because they are small
  71                 enough to only contain subphone entities. Singing for $25ms$ is
  72                 impossible so it is arguable that the window size is very small.
  73         \item The standard \gls{FT} gives a spectral representation that has
  74                 linearly scaled frequencies. This scale is converted to the \gls{MS}
  75                 using triangular overlapping windows.
  76         \item The log is taken of the Mel frequencies. This step is inspired by the
  77                 \emph{Weber-Fechner} law that describes how humans perceive physical
  78                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
  79                 Psychophysik}
  80         \item To decorrelate the signal a \gls{DCT} is applied. The \gls{MFCC}
  81                 features are then the amplitudes of the spectrum.
  82 \end{enumerate}
  83
  84 \section{\gls{ANN} Classifier}
  85 \todo{Spectrals might be enough, no decorrelation}
  86
  87 \section{Model training}
  88
  89 \section{Experiments}
  90
  91 \section{Results}
  92