methods.tex

   1 %Methodology
   2
   3 %Experiment(s) (set-up, data, results, discussion)
   4 \section{Data \& Preprocessing}
   5 To answer the research question, several experiments have been performed. Data
   6 has been collected from several \gls{dm} and \gls{dom} albums. The exact data
   7 used is available in Appendix~\ref{app:data}. The albums are extracted from the
   8 audio CD and converted to a mono channel waveform with the correct samplerate
   9 utilizing \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. Every file
  10 is annotated using Praat~\cite{boersma_praat_2002} where the lyrics are
  11 manually aligned to the audio. Examples of utterances are shown in
  12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
  13 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
  14 that within the genre of death metal there are different spectral patterns
  15 visible over time.
  16
  17 \begin{figure}[ht]
  18         \centering
  19         \includegraphics[width=.7\linewidth]{cement}
  20         \caption{A vocal segment of the Cannibal Corpse song
  21                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  22 \end{figure}
  23
  24 \begin{figure}[ht]
  25         \centering
  26         \includegraphics[width=.7\linewidth]{abominations}
  27         \caption{A vocal segment of the Disgorge song
  28                 \emph{Enthroned Abominations}}\label{fig:abominations}
  29 \end{figure}
  30
  31 The data is collected from three studio albums. The first band is called
  32 \gls{CC} and has been producing \gls{dm} for almost 25 years and has been
  33 creating albums with a consistent style. The singer of \gls{CC} has a very
  34 raspy growl and the lyrics are quite comprehensible. The vocals produced by
  35 \gls{CC} are very close to regular shouting.
  36
  37 The second band is called \gls{DG} and makes even more violently
  38 sounding music. The growls of the lead singer sound like a coffee grinder and
  39 are sound less full. In the spectrals it is clearly visible that there are
  40 overtones produced during some parts of the growling. The lyrics are completely
  41 incomprehensible and therefore some parts were not annotated with the actual
  42 lyrics because it was impossible to hear what was being sung.
  43
  44 The third band --- originating from Moscow --- is chosen bearing the name
  45 \gls{WDISS}. This band is a little odd compared to the previous \gls{dm} bands
  46 because they create \gls{dom}. \gls{dom} is characterized by the very slow
  47 tempo and low tuned guitars. The vocalist has a very characteristic growl and
  48 performs in several Muscovite bands. This band also stands out because it uses
  49 piano's and synthesizers. The droning synthesizers often operate in the same
  50 frequency as the vocals.
  51
  52 Additional details about the dataset are listed in Appendix~\ref{app:data}.
  53 The data is labeled as singing and instrumental and labeled per band. The
  54 distribution for this is shown in Table~\ref{tbl:distribution}.
  55 \begin{table}[H]
  56         \centering
  57         \begin{tabular}{lcc}
  58                 \toprule
  59                 Instrumental & Singing\\
  60                 \midrule
  61                 0.59 & 0.41\\
  62                 \bottomrule
  63         \end{tabular}
  64         \quad
  65         \begin{tabular}{lcccc}
  66                 \toprule
  67                 Instrumental & \gls{CC} & \gls{DG} & \gls{WDISS}\\
  68                 \midrule
  69                 0.59 & 0.16 & 0.19 & 0.06\\
  70                 \bottomrule
  71         \end{tabular}
  72         \caption{Data distribution}\label{tbl:distribution}
  73 \end{table}
  74
  75 \section{Mel-frequencey Cepstral Features}
  76 The waveforms in itself are not very suitable to be used as features due to the
  77 high dimensionality and correlation in the temporal domain. Therefore we use
  78 the often used \glspl{MFCC} feature vectors which have shown to be suitable in
  79 speech processing~\cite{rocamora_comparing_2007}. It has also been found that
  80 altering the mel scale to better suit singing does not yield a better
  81 performance~\cite{you_comparative_2015}. The actual conversion is done using
  82 the \emph{python\_speech\_features}\footnote{\url{%
  83 https://github.com/jameslyons/python_speech_features}} package.
  84
  85 \gls{MFCC} features are inspired by human auditory processing and are
  86 created from a waveform incrementally using several steps:
  87 \begin{enumerate}
  88         \item The first step in the process is converting the time representation
  89                 of the signal to a spectral representation using a sliding analysis
  90                 window with overlap. The width of the window and the step size are two
  91                 important parameters in the system. In classical phonetic analysis
  92                 window sizes of $25ms$ with a step of $10ms$ are often chosen because
  93                 they are small enough to contain just one subphone event. Singing for
  94                 $25ms$ is impossible so it might be necessary to increase the window
  95                 size.
  96         \item The standard \gls{FT} gives a spectral representation that has
  97                 linearly scaled frequencies. This scale is converted to the \gls{MS}
  98                 using triangular overlapping windows to get a more tonotopic
  99                 representation trying to match the actual representation of the cochlea
 100                 in the human ear.
 101         \item The \emph{Weber-Fechner} law describes how humans perceive physical
 102                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
 103                 Psychophysik}. They found that energy is perceived in logarithmic
 104                 increments. This means that twice the amount of energy does not mean
 105                 twice the amount of perceived loudness. Therefore we take the logarithm
 106                 of the energy or amplitude of the \gls{MS} spectrum to closer match the
 107                 human hearing.
 108         \item The amplitudes of the spectrum are highly correlated and therefore
 109                 the last step is a decorrelation step. \Gls{DCT} is applied on the
 110                 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
 111                 describing a signal as a combination of several primitive cosine
 112                 functions.
 113 \end{enumerate}
 114
 115 The default number of \gls{MFCC} parameters is twelve. However, often a
 116 thirteenth value is added that represents the energy in the analysis window.
 117 The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It
 118 represents the overall energy in the \gls{MS}. Another option would be
 119 $\log{(E)}$ which is the logarithm of the raw energy of the sample.
 120
 121 \section{Artificial Neural Network}
 122 The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
 123 The classification problems are only binary or four-class problems so it is
 124 interesting to see where the bottleneck lies; how abstract can the abstraction
 125 be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
 126 using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
 127 backend that provides a high-level interface to the highly technical networks.
 128
 129 The general architecture of the networks is show in Figure~\ref{fig:bcann} and
 130 Figure~\ref{fig:mcann} for respectively the binary classification and
 131 multiclass classification. The inputs are fully connected to the hidden layer
 132 which is fully connected too the output layer. The activation function used is
 133 a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
 134 function that is also known as the ramp function. The definition is given in
 135 Equation~\ref{eq:relu}. \gls{RELU} has the downside that it can create
 136 unreachable nodes in a deep network. This is not a problem in this network
 137 since it only has one hidden layer. \gls{RELU} was also chosen because of its
 138 efficient computation and nature inspiredness.
 139
 140 The activation function between the hidden layer and the output layer is the
 141 sigmoid function in the case of binary classification, of which the definition
 142 is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a monotonic function that
 143 is differentiable on all values of $x$ and always yields a non-negative
 144 derivative. For the multiclass classification the softmax function is used
 145 between the hidden layer and the output layer. Softmax is an activation
 146 function suitable for multiple output nodes. The definition is given in
 147 Equation~\ref{eq:softmax}.
 148
 149 The data is shuffled before fed to the network to mitigate the risk of
 150 overfitting on one album. Every model was trained using $10$ epochs which means
 151 that all training data is offered to the model $10$ times. The training set and
 152 test set are separated by taking a $90\%$ slice of all the data.
 153
 154 \begin{equation}\label{eq:relu}
 155         f(x) = \left\{\begin{array}{rcl}
 156                 0 & \text{for} & x<0\\
 157                 x & \text{for} & x \geq 0\\
 158         \end{array}\right.
 159 \end{equation}
 160
 161 \begin{equation}\label{eq:sigmoid}
 162         f(x) = \frac{1}{1+e^{-x}}
 163 \end{equation}
 164
 165 \begin{equation}\label{eq:softmax}
 166         \delta{(\boldsymbol{z})}_j = \frac{e^{z_j}}{\sum\limits^{K}_{k=1}e^{z_k}}
 167 \end{equation}
 168
 169 \begin{figure}[H]
 170         \begin{subfigure}{.5\textwidth}
 171                 \centering
 172                 \includegraphics[width=.8\linewidth]{bcann}
 173                 \caption{Binary classifier network architecture}\label{fig:bcann}
 174         \end{subfigure}%
 175 %
 176         \begin{subfigure}{.5\textwidth}
 177                 \centering
 178                 \includegraphics[width=.8\linewidth]{mcann}
 179                 \caption{Multiclass classifier network architecture}\label{fig:mcann}
 180         \end{subfigure}
 181         \caption{Artificial Neural Network architectures.}
 182 \end{figure}
 183
 184 \section{Experimental setup}
 185 \subsection{Features}
 186 The thirteen \gls{MFCC} features are used as the input. The parameters of the
 187 \gls{MFCC} features are varied in window step and window length. The default
 188 speech processing parameters are tested but also bigger window sizes since
 189 arguably the minimal size of a singing voice segment is a lot bigger than the
 190 minimal size of a subphone component on which the parameters are tuned. The
 191 parameters chosen are as follows:
 192
 193 \begin{table}[H]
 194         \centering
 195         \begin{tabular}{lll}
 196                 \toprule
 197                 step (ms) & length (ms) & notes\\
 198                 \midrule
 199                 10 & 25 & Standard speech processing\\
 200                 40 & 100 &\\
 201                 80 & 200 &\\
 202                 \bottomrule
 203         \end{tabular}
 204         \caption{\Gls{MFCC} parameter settings}
 205 \end{table}
 206
 207 \subsection{\emph{Singing}-voice detection}
 208 The first type of experiment conducted is \emph{Singing}-voice detection. This
 209 is the act of segmenting an audio signal into segments that are labeled either
 210 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
 211 feature vector and the output is the probability that singing is happening in
 212 the sample. This results in an \gls{ANN} of the shape described in
 213 Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
 214
 215 \subsection{\emph{Singer}-voice detection}
 216 The second type of experiment conducted is \emph{Singer}-voice detection. This
 217 is the act of segmenting an audio signal into segments that are labeled either
 218 with the name of the singer or as \emph{Instrumental}. The input of the
 219 classifier is a feature vector and the outputs are probabilities for each of
 220 the singers and a probability for the instrumental label. This results in an
 221 \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
 222 is yet again thirteen and the output dimension is the number of categories. The
 223 output is encoded in one-hot encoding. This means that the categories are
 224 labeled as \texttt{1000, 0100, 0010, 0001}.