methods.tex

   1 %Methodology
   2
   3 %Experiment(s) (set-up, data, results, discussion)
   4 \section{Data \& Preprocessing}
   5 To run the experiments data has been collected from several \gls{dm} albums.
   6 The exact data used is available in Appendix~\ref{app:data}. The albums are
   7 extracted from the audio CD and converted to a mono channel waveform with the
   8 correct samplerate utilizing \emph{SoX}%
   9 \footnote{\url{http://sox.sourceforge.net/}}.  Every file is annotated using
  10 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to the
  11 audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and
  12 Figure~\ref{fig:abominations} where the waveform, $1-8000$Hz spectrals and
  13 annotations are shown. It is clearly visible that within the genre of death
  14 metal there are different spectral patterns visible over time.
  15
  16 \begin{figure}[ht]
  17         \centering
  18         \includegraphics[width=.7\linewidth]{cement}
  19         \caption{A vocal segment of the \acrlong{CC} song
  20                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  21 \end{figure}
  22
  23 \begin{figure}[ht]
  24         \centering
  25         \includegraphics[width=.7\linewidth]{abominations}
  26         \caption{A vocal segment of the \acrlong{DG} song
  27                 \emph{Enthroned Abominations}}\label{fig:abominations}
  28 \end{figure}
  29
  30 The data is collected from three studio albums. The first band is called
  31 \gls{CC} and has been producing \gls{dm} for almost 25 years and has been
  32 creating album with a consistent style. The singer of \gls{CC} has a very raspy
  33 growl and the lyrics are quite comprehensible. The vocals produced by \gls{CC}
  34 border regular shouting.
  35
  36 The second band is called \gls{DG} and makes even more violently
  37 sounding music. The growls of the lead singer sound like a coffee grinder and
  38 are more shallow. In the spectrals it is clearly visible that there are
  39 overtones produced during some parts of the growling. The lyrics are completely
  40 incomprehensible and therefore some parts were not annotated with the actual
  41 lyrics because it was impossible to hear what was being sung.
  42
  43 Lastly a band from Moscow is chosen bearing the name \gls{WDISS}. This band is
  44 a little odd compared to the previous \gls{dm} bands because they create
  45 \gls{dom}. \gls{dom} is characterized by the very slow tempo and low tuned
  46 guitars. The vocalist has a very characteristic growl and performs in several
  47 Muscovite bands. This band also stands out because it uses piano's and
  48 synthesizers. The droning synthesizers often operate in the same frequency as
  49 the vocals.
  50
  51 The training and test data is divided as follows:
  52 \begin{table}[H]
  53         \centering
  54         \begin{tabular}{lcc}
  55                 \toprule
  56                 Singing & Instrumental\\
  57                 \midrule
  58                 0.59 & 0.41\\
  59                 \bottomrule
  60         \end{tabular}
  61         \quad
  62         \begin{tabular}{lcccc}
  63                 \toprule
  64                 Instrumental & \gls{CC} & \gls{DG} & \gls{WDISS}\\
  65                 \midrule
  66                 0.59 & 0.16 & 0.19 & 0.06\\
  67                 \bottomrule
  68         \end{tabular}
  69 \end{table}
  70
  71 \section{\acrlong{MFCC} Features}
  72 The waveforms in itself are not very suitable to be used as features due to the
  73 high dimensionality and correlation. Therefore we use the often used
  74 \glspl{MFCC} feature vectors which have shown to be suitable%
  75 \cite{rocamora_comparing_2007}. It has also been found that altering the mel
  76 scale to better suit singing does not yield a better
  77 performance\cite{you_comparative_2015}. The actual conversion is done using the
  78 \emph{python\_speech\_features}%
  79 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
  80
  81 \gls{MFCC} features are inspired by human auditory processing inspired and are
  82 created from a waveform incrementally using several steps:
  83 \begin{enumerate}
  84         \item The first step in the process is converting the time representation
  85                 of the signal to a spectral representation using a sliding window with
  86                 overlap. The width of the window and the step size are two important
  87                 parameters in the system. In classical phonetic analysis window sizes
  88                 of $25ms$ with a step of $10ms$ are often chosen because they are small
  89                 enough to only contain subphone entities. Singing for $25ms$ is
  90                 impossible so it is arguable that the window size is very small.
  91         \item The standard \gls{FT} gives a spectral representation that has
  92                 linearly scaled frequencies. This scale is converted to the \gls{MS}
  93                 using triangular overlapping windows to get a more tonotopic
  94                 representation trying to match the actual representation in the cochlea
  95                 of the human ear.
  96         \item The \emph{Weber-Fechner} law describes how humans perceive physical
  97                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
  98                 Psychophysik}. They found that energy is perceived in logarithmic
  99                 increments. This means that twice the amount of decibels does not mean
 100                 twice the amount of perceived loudness. Therefore we take the log of
 101                 the energy or amplitude of the \gls{MS} spectrum to closer match the
 102                 human hearing.
 103         \item The amplitudes of the spectrum are highly correlated and therefore
 104                 the last step is a decorrelation step. \Gls{DCT} is applied on the
 105                 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
 106                 describing a signal as a combination of several primitive cosine
 107                 functions.
 108 \end{enumerate}
 109
 110 The default number of \gls{MFCC} parameters is twelve. However, often a
 111 thirteenth value is added that represents the energy in the data.
 112
 113 \section{Experimental setup}
 114 \subsection{Features}
 115 The thirteen \gls{MFCC} features are chosen to feed to the classifier. The
 116 parameters of the \gls{MFCC} features are varied in window step and window
 117 length. The default speech processing parameters are tested but also bigger
 118 window sizes since arguably the minimal size of a singing voice segment is a
 119 lot bigger than the minimal size of a subphone component on which the
 120 parameters are tuned.  The parameters chosen are as follows:
 121
 122 \begin{table}[H]
 123         \centering
 124         \begin{tabular}{lll}
 125                 \toprule
 126                 step (ms) & length (ms) & notes\\
 127                 \midrule
 128                 10 & 25 & Standard speech processing\\
 129                 40 & 100 &\\
 130                 80 & 200 &\\
 131                 \bottomrule
 132         \end{tabular}
 133         \caption{\Gls{MFCC} parameter settings}
 134 \end{table}
 135
 136 \subsection{\emph{Singing} voice detection}
 137 The first type of experiment conducted is \emph{Singing} voice detection. This
 138 is the act of segmenting an audio signal into segments that are labeled either
 139 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
 140 feature vector and the output is the probability that singing is happening in
 141 the sample. This results in an \gls{ANN} of the shape described in
 142 Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
 143
 144 \subsection{\emph{Singer} voice detection}
 145 The second type of experiment conducted is \emph{Singer} voice detection. This
 146 is the act of segmenting an audio signal into segments that are labeled either
 147 with the name of the singer or as \emph{Instrumental}. The input of the
 148 classifier is a feature vector and the outputs are probabilities for each of
 149 the singers and a probability for the instrumental label. This results in an
 150 \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
 151 is yet again thirteen and the output dimension is the number of categories. The
 152 output is encoded in one-hot encoding. This means that the categories are
 153 labeled as \texttt{1000, 0100, 0010, 0001}.
 154
 155 \subsection{\acrlong{ANN}}
 156 The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
 157 The classification problems are only binary and four-class so it is
 158 interesting to see where the bottleneck lies; how abstract can the abstraction
 159 be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
 160 using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
 161 backend that provides a high-level interface to the highly technical networks.
 162
 163 The general architecture of the networks is show in Figure~\ref{fig:bcann} and
 164 Figure~\ref{fig:mcann} for respectively the binary classification and
 165 multiclass classification. The inputs are fully connected to the hidden layer
 166 which is fully connected too the output layer. The activation function used is
 167 a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
 168 function that is also known as the ramp function. The definition is given in
 169 Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and
 170 efficient computation. The activation function between the hidden layer and the
 171 output layer is the sigmoid function in the case of binary classification. Of
 172 which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a
 173 monotonic function that is differentiable on all values of $x$ and always
 174 yields a non-negative derivative. For the multiclass classification the softmax
 175 function is used between the hidden layer and the output layer. Softmax is an
 176 activation function suitable for multiple output nodes. The definition is given
 177 in Equation~\ref{eq:softmax}.
 178
 179 The data is shuffled before fed to the network to mitigate the risk of
 180 overfitting on one album. Every model was trained using $10$ epochs and a
 181 batch size of $32$.
 182
 183 \begin{equation}\label{eq:relu}
 184         f(x) = \left\{\begin{array}{rcl}
 185                 0 & \text{for} & x<0\\
 186                 x & \text{for} & x \geq 0\\
 187         \end{array}\right.
 188 \end{equation}
 189
 190 \begin{equation}\label{eq:sigmoid}
 191         f(x) = \frac{1}{1+e^{-x}}
 192 \end{equation}
 193
 194 \begin{equation}\label{eq:softmax}
 195         \delta{(\boldsymbol{z})}_j = \frac{e^{z_j}}{\sum\limits^{K}_{k=1}e^{z_k}}
 196 \end{equation}
 197
 198 \begin{figure}[H]
 199         \begin{subfigure}{.5\textwidth}
 200                 \centering
 201                 \includegraphics[width=.8\linewidth]{bcann}
 202                 \caption{Binary classifier network architecture}\label{fig:bcann}
 203         \end{subfigure}%
 204 %
 205         \begin{subfigure}{.5\textwidth}
 206                 \centering
 207                 \includegraphics[width=.8\linewidth]{mcann}
 208                 \caption{Multiclass classifier network architecture}\label{fig:mcann}
 209         \end{subfigure}
 210         \caption{\acrlong{ANN} architectures.}
 211 \end{figure}
 212
 213 \section{Results}
 214 \subsection{\emph{Singing} voice detection}
 215
 216 \begin{table}[H]
 217         \centering
 218         \begin{tabular}{rccc}
 219                 \toprule
 220                    & \multicolumn{3}{c}{Parameters (step/length)}\\
 221                     & 10/25 & 40/100 & 80/200\\
 222                 \midrule
 223                 3h  & 0.86 (0.34) & 0.87 (0.32) & 0.85 (0.35)\\
 224                 5h  & 0.87 (0.31) & 0.88 (0.30) & 0.87 (0.32)\\
 225                 8h  & 0.88 (0.30) & 0.88 (0.31) & 0.88 (0.29)\\
 226                 13h & 0.89 (0.28) & 0.89 (0.29) & 0.88 (0.30)\\
 227                 \bottomrule
 228         \end{tabular}
 229         \caption{Binary classification results (accuracy (loss))}
 230 \end{table}
 231
 232 \begin{figure}[H]
 233         \centering
 234         \includegraphics[width=.6\linewidth]{bclass}.
 235         \caption{Plotting the classifier under the audio signal}
 236 \end{figure}
 237
 238 \subsection{\emph{Singer} voice detection}
 239
 240 \begin{table}[H]
 241         \centering
 242         \begin{tabular}{rccc}
 243                 \toprule
 244                    & \multicolumn{3}{c}{Parameters (step/length)}\\
 245                     & 10/25 & 40/100 & 80/200\\
 246                 \midrule
 247                 3h  & 0.83 (0.48) & 0.82 (0.48) & 0.82 (0.48)\\
 248                 5h  & 0.85 (0.43) & 0.84 (0.44) & 0.84 (0.44)\\
 249                 8h  & 0.86 (0.41) & 0.86 (0.39) & 0.86 (0.40)\\
 250                 13h & 0.87 (0.37) & 0.87 (0.38) & 0.86 (0.39)\\
 251                 \bottomrule
 252         \end{tabular}
 253         \caption{Multiclass classification results (accuracy (loss))}
 254 \end{table}
 255
 256 \subsection{Alien data}
 257 To test the generalizability of the models the system is tested on alien data.
 258 The data was retrieved from the album \emph{The Desperation} by \emph{Godless
 259 Truth}. \emph{Godless Truth} is a so called old-school \gls{dm} band that has
 260 very raspy vocals and the vocals are very up front in the mastering. This means
 261 that the vocals are very prevalent in the recording and therefore no difficulty
 262 is expected for the classifier. Figure~\ref{fig:alien1} shows that indeed the
 263 classifier scores very accurately. Note that the spectogram settings have been
 264 adapted a little bit to make the picture more clear. The spectogram shows the
 265 frequency range from $0$ to $3000Hz$.
 266
 267 \begin{figure}[H]
 268         \centering
 269         \includegraphics[width=.6\linewidth]{alien1}.
 270         \caption{Plotting the classifier under similar alien data}\label{fig:alien1}
 271 \end{figure}
 272
 273 To really test the limits, a song from the highly atmospheric doom metal band
 274 called \emph{Catacombs} has been tested on the system. The album \emph{Echoes
 275 Through the Catacombs} is an album that has a lot of synthesizers, heavy
 276 droning guitars and bass lines. The vocals are not mixed in a way that makes
 277 them stand out. The models have never seen trainingsdata that is even remotely
 278 similar to this type of metal. Figure~\ref{fig:alien2} shows a segment of the
 279 data. Here it is clearly visible that the classifier can not distinguish
 280 singing from non singing.
 281
 282 \begin{figure}[H]
 283         \centering
 284         \includegraphics[width=.6\linewidth]{alien2}.
 285         \caption{Plotting the classifier under different alien data}\label{fig:alien2}
 286 \end{figure}