methods.tex

   1 %Methodology
   2
   3 %Experiment(s) (set-up, data, results, discussion)
   4 \section{Data \& Preprocessing}
   5 To run the experiments data has been collected from several \gls{dm} albums.
   6 The exact data used is available in Appendix~\ref{app:data}. The albums are
   7 extracted from the audio CD and converted to a mono channel waveform with the
   8 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
   9 Every file is annotated using
  10 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
  11 the audio. Examples of utterances are shown in
  12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
  13 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
  14 that within the genre of death metal there are a different spectral patterns
  15 visible.
  16
  17 \begin{figure}[ht]
  18         \centering
  19         \includegraphics[width=.7\linewidth]{cement}
  20         \caption{A vocal segment of the \emph{Cannibal Corpse} song
  21                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  22 \end{figure}
  23
  24 \begin{figure}[ht]
  25         \centering
  26         \includegraphics[width=.7\linewidth]{abominations}
  27         \caption{A vocal segment of the \emph{Disgorge} song
  28                 \emph{Enthroned Abominations}}\label{fig:abominations}
  29 \end{figure}
  30
  31 The data is collected from three studio albums. The
  32 first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
  33 almost 25 years and have been creating the same type every album. The singer of
  34 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
  35 comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
  36 regular shouting.
  37
  38 The second band is called \emph{Disgorge} and make even more violently sounding
  39 music. The growls of the lead singer sound like a coffee grinder and are more
  40 shallow. In the spectrals it is clearly visible that there are overtones
  41 produced during some parts of the growling. The lyrics are completely
  42 incomprehensible and therefore some parts were not annotated with the actual
  43 lyrics because it was not possible what was being sung.
  44
  45 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
  46 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
  47 bands because they create \gls{dom}. \gls{dom} is characterized by the very
  48 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
  49 and performs in several Muscovite bands. This band also stands out because it
  50 uses piano's and synthesizers. The droning synthesizers often operate in the
  51 same frequency as the vocals.
  52
  53 The training and test data is divided as follows:
  54 \begin{table}[H]
  55         \centering
  56         \begin{tabular}{lcc}
  57                 \toprule
  58                 Singing & Instrumental\\
  59                 \midrule
  60                 0.59 & 0.41\\
  61                 \bottomrule
  62         \end{tabular}
  63         \quad
  64         \begin{tabular}{lcccc}
  65                 \toprule
  66                 Instrumental & CC & DG & WDISS\\
  67                 \midrule
  68                 \bottomrule
  69         \end{tabular}
  70 \end{table}
  71
  72 \section{\gls{MFCC} Features}
  73 The waveforms in itself are not very suitable to be used as features due to the
  74 high dimensionality and correlation. Therefore we use the often used
  75 \glspl{MFCC} feature vectors which has shown to be
  76 suitable\cite{rocamora_comparing_2007}. It has also been found that altering
  77 the mel scale to better suit singing does not yield a better
  78 performance\cite{you_comparative_2015}. The actual conversion is done using the
  79 \emph{python\_speech\_features}%
  80 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
  81
  82 \gls{MFCC} features are nature inspired and built incrementally in several
  83 steps.
  84 \begin{enumerate}
  85         \item The first step in the process is converting the time representation
  86                 of the signal to a spectral representation using a sliding window with
  87                 overlap. The width of the window and the step size are two important
  88                 parameters in the system. In classical phonetic analysis window sizes
  89                 of $25ms$ with a step of $10ms$ are often chosen because they are small
  90                 enough to only contain subphone entities. Singing for $25ms$ is
  91                 impossible so it is arguable that the window size is very small.
  92         \item The standard \gls{FT} gives a spectral representation that has
  93                 linearly scaled frequencies. This scale is converted to the \gls{MS}
  94                 using triangular overlapping windows.
  95         \item The log is taken of the Mel frequencies. This step is inspired by the
  96                 \emph{Weber-Fechner} law that describes how humans perceive physical
  97                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
  98                 Psychophysik}
  99         \item To decorrelate the signal a \gls{DCT} is applied. The \gls{MFCC}
 100                 features are then the amplitudes of the spectrum.
 101 \end{enumerate}
 102
 103 \section{\gls{ANN} Classifier}
 104 \todo{Spectrals might be enough, no decorrelation}
 105
 106 \section{Experiments}
 107 \subsection{\emph{Singing} voice detection}
 108 The first type of experiment conducted is \emph{Singing} voice detection. This
 109 is the act of segmenting an audio signal into segments that are labeled either
 110 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
 111 feature vector and the output is the probability that singing is happening in
 112 the sample.
 113
 114 \begin{figure}[H]
 115         \centering
 116         \includegraphics[width=.5\textwidth]{bcann}
 117         \caption{Binary classifier network architecture}\label{fig:bcann}
 118 \end{figure}
 119
 120 \subsection{\emph{Singer} voice detection}
 121 The second type of experiment conducted is \emph{Singer} voice detection. This
 122 is the act of segmenting an audio signal into segments that are labeled either
 123 with the name of the singer or as \emph{Instrumental}. The input of the
 124 classifier is a feature vector and the outputs are probabilities for each of
 125 the singers and a probability for the instrumental label.
 126
 127 \section{Results}
 128