methods.tex

   1 %Methodology
   2
   3 %Experiment(s) (set-up, data, results, discussion)
   4 \section{Data \& Preprocessing}
   5 To answer the research question, several experiments have been performed. Data
   6 has been collected from several \gls{dm} and \gls{dom} albums. The exact data
   7 used is available in Appendix~\ref{app:data}. The albums are extracted from the
   8 audio CD and converted to a mono channel waveform with the correct samplerate
   9 utilizing \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. Every file
  10 is annotated using Praat~\cite{boersma_praat_2002} where the lyrics are
  11 manually aligned to the audio. Examples of utterances are shown in
  12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
  13 waveform, $1-8000$Hz spectrogram and annotations are shown. It is clearly
  14 visible that within the genre of death metal there are different spectral
  15 patterns visible over time.
  16
  17 \begin{figure}[ht]
  18         \centering
  19         \includegraphics[width=.7\linewidth]{cement}
  20         \caption{A vocal segment of the Cannibal Corpse song
  21                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  22 \end{figure}
  23
  24 \begin{figure}[ht]
  25         \centering
  26         \includegraphics[width=.7\linewidth]{abominations}
  27         \caption{A vocal segment of the Disgorge song
  28                 \emph{Enthroned Abominations}}\label{fig:abominations}
  29 \end{figure}
  30
  31 The data is collected from three studio albums. The first band is called
  32 \gls{CC} and has been producing \gls{dm} for almost 25 years and has been
  33 creating albums with a consistent style. The singer of \gls{CC} has a very
  34 raspy growl and the lyrics are quite comprehensible. The vocals produced by
  35 \gls{CC} are very close to regular shouting.
  36
  37 The second band is called \gls{DG} and makes even more violently
  38 sounding music. The growls of the lead singer sound like a coffee grinder and
  39 are sound less full. In the spectrals it is clearly visible that there are
  40 overtones produced during some parts of the growling. The lyrics are completely
  41 incomprehensible and therefore some parts were not annotated with the actual
  42 lyrics because it was impossible to hear what was being sung.
  43
  44 The third band --- originating from Moscow --- is chosen bearing the name
  45 \gls{WDISS}. This band is a little odd compared to the previous \gls{dm} bands
  46 because they create \gls{dom}. \gls{dom} is characterized by the very slow
  47 tempo and low tuned guitars. The vocalist has a very characteristic growl and
  48 performs in several Muscovite bands. This band also stands out because it uses
  49 piano's and synthesizers. The droning synthesizers often operate in the same
  50 frequency as the vocals.
  51
  52 Additional details about the dataset such are listed in
  53 Appendix~\ref{app:data}.  The data is labeled as singing and instrumental and
  54 labeled per band. The distribution for this is shown in
  55 Table~\ref{tbl:distribution}.
  56 \begin{table}[H]
  57         \centering
  58         \begin{tabular}{lcc}
  59                 \toprule
  60                 Instrumental & Singing\\
  61                 \midrule
  62                 0.59 & 0.41\\
  63                 \bottomrule
  64         \end{tabular}
  65         \quad
  66         \begin{tabular}{lcccc}
  67                 \toprule
  68                 Instrumental & \gls{CC} & \gls{DG} & \gls{WDISS}\\
  69                 \midrule
  70                 0.59 & 0.16 & 0.19 & 0.06\\
  71                 \bottomrule
  72         \end{tabular}
  73         \caption{Proportional data distribution}\label{tbl:distribution}
  74 \end{table}
  75
  76 \section{Mel-frequencey Cepstral Features}
  77 The waveforms in itself are not very suitable to be used as features due to the
  78 high dimensionality and correlation in the temporal domain. Therefore we use
  79 the often used \glspl{MFCC} feature vectors which have shown to be suitable in
  80 speech processing~\cite{rocamora_comparing_2007}. It has also been found that
  81 altering the mel scale to better suit singing does not yield a better
  82 performance~\cite{you_comparative_2015}. The actual conversion is done using
  83 the \emph{python\_speech\_features}\footnote{\url{%
  84 https://github.com/jameslyons/python_speech_features}} package.
  85
  86 \gls{MFCC} features are inspired by human auditory processing and are
  87 created from a waveform incrementally using several steps:
  88 \begin{enumerate}
  89         \item The first step in the process is converting the time representation
  90                 of the signal to a spectral representation using a sliding analysis
  91                 window with overlap. The width of the window and the step size are two
  92                 important parameters in the system. In classical phonetic analysis
  93                 window sizes of $25ms$ with a step of $10ms$ are often chosen because
  94                 they are small enough to contain just one subphone event.
  95         \item The standard \gls{FT} gives a spectral representation that has
  96                 linearly scaled frequencies. This scale is converted to the \gls{MS}
  97                 using triangular overlapping windows to get a more tonotopic
  98                 representation trying to match the actual representation of the cochlea
  99                 in the human ear.
 100         \item The \emph{Weber-Fechner} law describes how humans perceive physical
 101                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
 102                 Psychophysik}. They found that energy is perceived in logarithmic
 103                 increments. This means that twice the amount of energy does not mean
 104                 twice the amount of perceived loudness. Therefore we take the logarithm
 105                 of the energy or amplitude of the \gls{MS} spectrum to closer match the
 106                 human hearing.
 107         \item The amplitudes of the spectrum are highly correlated and therefore
 108                 the last step is a decorrelation step. \Gls{DCT} is applied on the
 109                 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
 110                 describing a signal as a combination of several primitive cosine
 111                 functions.
 112 \end{enumerate}
 113
 114 The default number of \gls{MFCC} parameters is twelve. However, often a
 115 thirteenth value is added that represents the energy in the analysis window.
 116 The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It
 117 represents the average over all \gls{MS} bands. Another option would be
 118 $\log{(E)}$ which is the logarithm of the raw energy of the sample.
 119
 120 \section{Artificial Neural Network}
 121 The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
 122 The classification problems are only binary or four-class problems so it is
 123 interesting to see where the bottleneck lies; how abstract can the abstraction
 124 be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
 125 using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
 126 backend that provides a high-level interface to the highly technical networks.
 127
 128 The general architecture of the networks is show in Figure~\ref{fig:bcann} and
 129 Figure~\ref{fig:mcann} for respectively the binary classification and
 130 multiclass classification. The inputs are fully connected to the hidden layer
 131 which is fully connected too the output layer. The activation function used is
 132 a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
 133 function that is also known as the ramp function. The definition is given in
 134 Equation~\ref{eq:relu}. \gls{RELU} has the downside that it can create
 135 unreachable nodes in a deep network. This is not a problem in this network
 136 since it only has one hidden layer. \gls{RELU} was also chosen because of its
 137 efficient computation and nature inspiredness.
 138
 139 The activation function between the hidden layer and the output layer is the
 140 sigmoid function in the case of binary classification, of which the definition
 141 is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a monotonic function that
 142 is differentiable on all values of $x$ and always yields a non-negative
 143 derivative. For the multiclass classification the softmax function is used
 144 between the hidden layer and the output layer. Softmax is an activation
 145 function suitable for multiple output nodes. The definition is given in
 146 Equation~\ref{eq:softmax}.
 147
 148 The data is shuffled before fed to the network to mitigate the risk of
 149 overfitting on one album. Every model was trained using $10$ epochs which means
 150 that all training data is offered to the model $10$ times. The training set and
 151 test set are separated by taking a $90\%$ slice of all the data.
 152
 153 \begin{equation}\label{eq:relu}
 154         f(x) = \left\{\begin{array}{rcl}
 155                 0 & \text{for} & x<0\\
 156                 x & \text{for} & x \geq 0\\
 157         \end{array}\right.
 158 \end{equation}
 159
 160 \begin{equation}\label{eq:sigmoid}
 161         f(x) = \frac{1}{1+e^{-x}}
 162 \end{equation}
 163
 164 \begin{equation}\label{eq:softmax}
 165         \delta{(\boldsymbol{z})}_j = \frac{e^{z_j}}{\sum\limits^{K}_{k=1}e^{z_k}}
 166 \end{equation}
 167
 168 \begin{figure}[H]
 169         \begin{subfigure}{.5\textwidth}
 170                 \centering
 171                 \includegraphics[width=.8\linewidth]{bcann}
 172                 \caption{Binary classifier network architecture}\label{fig:bcann}
 173         \end{subfigure}%
 174 %
 175         \begin{subfigure}{.5\textwidth}
 176                 \centering
 177                 \includegraphics[width=.8\linewidth]{mcann}
 178                 \caption{Multiclass classifier network architecture}\label{fig:mcann}
 179         \end{subfigure}
 180         \caption{Artificial Neural Network architectures.}
 181 \end{figure}
 182
 183 \section{Experimental setup}
 184 \subsection{Features}
 185 The thirteen \gls{MFCC} features are used as the input. The parameters of the
 186 \gls{MFCC} features are varied in window step and window length. The default
 187 speech processing parameters are tested but also bigger window sizes since
 188 arguably the minimal size of a singing voice segment is a lot bigger than the
 189 minimal size of a subphone component on which the parameters are tuned. The
 190 parameters chosen are as follows:
 191
 192 \begin{table}[H]
 193         \centering
 194         \begin{tabular}{lll}
 195                 \toprule
 196                 step (ms) & length (ms) & notes\\
 197                 \midrule
 198                 10 & 25 & Standard speech processing\\
 199                 40 & 100 &\\
 200                 80 & 200 &\\
 201                 \bottomrule
 202         \end{tabular}
 203         \caption{\Gls{MFCC} parameter settings}
 204 \end{table}
 205
 206 \subsection{\emph{Singing}-voice detection}
 207 The first type of experiment conducted is \emph{Singing}-voice detection. This
 208 is the act of segmenting an audio signal into segments that are labeled either
 209 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
 210 feature vector and the output is the probability that singing is happening in
 211 the sample. This results in an \gls{ANN} of the shape described in
 212 Figure~\ref{fig:bcann}. The input dimension is thirteen and the output
 213 dimension is one.
 214
 215 The \emph{crosstenopy} function is used as the loss function. The
 216 formula is shown in Equation~\ref{eq:bincross} where $p$ is the true
 217 distribution and $q$ is the classification. Acurracy is the mean of the
 218 absolute differences between prediction and true value. The formula is show in
 219 Equation~\ref{eq:binacc}.
 220
 221 \begin{equation}\label{eq:bincross}
 222         H(p,q) = -\sum_x p(x)\log{q(x)}
 223 \end{equation}
 224
 225 \begin{equation}\label{eq:binacc}
 226         \frac{1}{n}\sum^n_{i=1} abs (ypred_i-y_i)
 227 \end{equation}
 228
 229 \subsection{\emph{Singer}-voice detection}
 230 The second type of experiment conducted is \emph{Singer}-voice detection. This
 231 is the act of segmenting an audio signal into segments that are labeled either
 232 with the name of the singer or as \emph{Instrumental}. The input of the
 233 classifier is a feature vector and the outputs are probabilities for each of
 234 the singers and a probability for the instrumental label. This results in an
 235 \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
 236 is yet again thirteen and the output dimension is the number of categories. The
 237 output is encoded in one-hot encoding. This means that the four categories in
 238 the experiments are labeled as \texttt{1000, 0100, 0010, 0001}.
 239
 240 The loss function is the same as in \emph{Singing}-voice detection.
 241 The accuracy is calculated a little differenty since the output of the network
 242 is not one probability but a vector of probabilities. The accuracy is
 243 calculated of each sample by only taking into account the highest value in the
 244 one-hot encoded vector. This exact formula is shown in Equation~\ref{eq:catacc}.
 245
 246 \begin{equation}\label{eq:catacc}
 247         \frac{1}{n}\sum^n_{i=1} abs(argmax(ypred_i)-argmax(y_i))
 248 \end{equation}