methods.tex

   1 %Methodology
   2
   3 %Experiment(s) (set-up, data, results, discussion)
   4 \section{Data \& Preprocessing}
   5 To run the experiments data has been collected from several \gls{dm} albums.
   6 The exact data used is available in Appendix~\ref{app:data}. The albums are
   7 extracted from the audio CD and converted to a mono channel waveform with the
   8 correct samplerate utilizing \emph{SoX}%
   9 \footnote{\url{http://sox.sourceforge.net/}}.  Every file is annotated using
  10 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to the
  11 audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and
  12 Figure~\ref{fig:abominations} where the waveform, $1-8000$Hz spectrals and
  13 annotations are shown. It is clearly visible that within the genre of death
  14 metal there are different spectral patterns visible over time.
  15
  16 \begin{figure}[ht]
  17         \centering
  18         \includegraphics[width=.7\linewidth]{cement}
  19         \caption{A vocal segment of the \emph{Cannibal Corpse} song
  20                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  21 \end{figure}
  22
  23 \begin{figure}[ht]
  24         \centering
  25         \includegraphics[width=.7\linewidth]{abominations}
  26         \caption{A vocal segment of the \emph{Disgorge} song
  27                 \emph{Enthroned Abominations}}\label{fig:abominations}
  28 \end{figure}
  29
  30 The data is collected from three studio albums. The first band is called
  31 \emph{Cannibal Corpse} and has been producing \gls{dm} for almost 25 years and
  32 have been creating the same type every album. The singer of \emph{Cannibal
  33 Corpse} has a very raspy growls and the lyrics are quite comprehensible. The
  34 vocals produced by \emph{Cannibal Corpse} are bordering regular shouting.
  35
  36 The second band is called \emph{Disgorge} and make even more violently sounding
  37 music. The growls of the lead singer sound like a coffee grinder and are more
  38 shallow. In the spectrals it is clearly visible that there are overtones
  39 produced during some parts of the growling. The lyrics are completely
  40 incomprehensible and therefore some parts were not annotated with the actual
  41 lyrics because it was not possible what was being sung.
  42
  43 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
  44 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
  45 bands because they create \gls{dom}. \gls{dom} is characterized by the very
  46 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
  47 and performs in several Muscovite bands. This band also stands out because it
  48 uses piano's and synthesizers. The droning synthesizers often operate in the
  49 same frequency as the vocals.
  50
  51 The training and test data is divided as follows:
  52 \begin{table}[H]
  53         \centering
  54         \begin{tabular}{lcc}
  55                 \toprule
  56                 Singing & Instrumental\\
  57                 \midrule
  58                 0.59 & 0.41\\
  59                 \bottomrule
  60         \end{tabular}
  61         \quad
  62         \begin{tabular}{lcccc}
  63                 \toprule
  64                 Instrumental & CC & DG & WDISS\\
  65                 \midrule
  66                 0.59 & 0.16 & 0.19 & 0.06\\
  67                 \bottomrule
  68         \end{tabular}
  69 \end{table}
  70
  71 \section{\gls{MFCC} Features}
  72 The waveforms in itself are not very suitable to be used as features due to the
  73 high dimensionality and correlation. Therefore we use the often used
  74 \glspl{MFCC} feature vectors which has shown to be
  75 suitable\cite{rocamora_comparing_2007}. It has also been found that altering
  76 the mel scale to better suit singing does not yield a better
  77 performance\cite{you_comparative_2015}. The actual conversion is done using the
  78 \emph{python\_speech\_features}%
  79 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
  80
  81 \gls{MFCC} features are inspired by human auditory processing inspired and
  82 built incrementally in several steps.
  83 \begin{enumerate}
  84         \item The first step in the process is converting the time representation
  85                 of the signal to a spectral representation using a sliding window with
  86                 overlap. The width of the window and the step size are two important
  87                 parameters in the system. In classical phonetic analysis window sizes
  88                 of $25ms$ with a step of $10ms$ are often chosen because they are small
  89                 enough to only contain subphone entities. Singing for $25ms$ is
  90                 impossible so it is arguable that the window size is very small.
  91         \item The standard \gls{FT} gives a spectral representation that has
  92                 linearly scaled frequencies. This scale is converted to the \gls{MS}
  93                 using triangular overlapping windows to get a more tonotopic
  94                 representation trying to match the actual representation in the cochlea
  95                 of the human ear.
  96         \item The \emph{Weber-Fechner} law that describes how humans perceive physical
  97                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
  98                 Psychophysik} and it was found that energy is perceived in logarithmic
  99                 increments. This means that twice the amount of decibels does not mean
 100                 twice the amount of perceived loudness. Therefore in this step log is
 101                 taken of energy or amplitude of the \gls{MS} frequency spectrum to
 102                 closer match the human hearing.
 103         \item The amplitudes of the spectrum are highly correlated and therefore
 104                 the last step is a decorrelation step. \Gls{DCT} is applied on the
 105                 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
 106                 describing a signal as a combination of several primitive cosine
 107                 functions.
 108 \end{enumerate}
 109
 110 \section{\gls{ANN} Classifier}
 111 \todo{Spectrals might be enough, no decorrelation}
 112
 113 \section{Experiments}
 114 \subsection{\emph{Singing} voice detection}
 115 The first type of experiment conducted is \emph{Singing} voice detection. This
 116 is the act of segmenting an audio signal into segments that are labeled either
 117 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
 118 feature vector and the output is the probability that singing is happening in
 119 the sample.
 120
 121 \begin{figure}[H]
 122         \centering
 123         \includegraphics[width=.5\textwidth]{bcann}
 124         \caption{Binary classifier network architecture}\label{fig:bcann}
 125 \end{figure}
 126
 127 \subsection{\emph{Singer} voice detection}
 128 The second type of experiment conducted is \emph{Singer} voice detection. This
 129 is the act of segmenting an audio signal into segments that are labeled either
 130 with the name of the singer or as \emph{Instrumental}. The input of the
 131 classifier is a feature vector and the outputs are probabilities for each of
 132 the singers and a probability for the instrumental label.
 133
 134 \section{Results}
 135