methods.tex

   1 %Methodology
   2
   3 %Experiment(s) (set-up, data, results, discussion)
   4 \section{Data \& Preprocessing}
   5 Several experiments have been performed to gain insight on the research
   6 question.  To run the experiments data has been collected from several \gls{dm}
   7 albums. The exact data used is available in Appendix~\ref{app:data}. The
   8 albums are extracted from the audio CD and converted to a mono channel waveform
   9 with the correct samplerate utilizing \emph{SoX}%
  10 \footnote{\url{http://sox.sourceforge.net/}}. Every file is annotated using
  11 Praat~\cite{boersma_praat_2002} where the lyrics are manually aligned to the
  12 audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and
  13 Figure~\ref{fig:abominations} where the waveform, $1-8000$Hz spectrals and
  14 annotations are shown. It is clearly visible that within the genre of death
  15 metal there are different spectral patterns visible over time.
  16
  17 \begin{figure}[ht]
  18         \centering
  19         \includegraphics[width=.7\linewidth]{cement}
  20         \caption{A vocal segment of the \acrlong{CC} song
  21                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  22 \end{figure}
  23
  24 \begin{figure}[ht]
  25         \centering
  26         \includegraphics[width=.7\linewidth]{abominations}
  27         \caption{A vocal segment of the \acrlong{DG} song
  28                 \emph{Enthroned Abominations}}\label{fig:abominations}
  29 \end{figure}
  30
  31 The data is collected from three studio albums. The first band is called
  32 \gls{CC} and has been producing \gls{dm} for almost 25 years and has been
  33 creating albums with a consistent style. The singer of \gls{CC} has a very
  34 raspy growl and the lyrics are quite comprehensible. The vocals produced by
  35 \gls{CC} are very close to regular shouting.
  36
  37 The second band is called \gls{DG} and makes even more violently
  38 sounding music. The growls of the lead singer sound like a coffee grinder and
  39 are sound less full. In the spectrals it is clearly visible that there are
  40 overtones produced during some parts of the growling. The lyrics are completely
  41 incomprehensible and therefore some parts were not annotated with the actual
  42 lyrics because it was impossible to hear what was being sung.
  43
  44 The third band --- originating from Moscow --- is chosen bearing the name
  45 \gls{WDISS}. This band is a little odd compared to the previous \gls{dm} bands
  46 because they create \gls{dom}. \gls{dom} is characterized by the very slow
  47 tempo and low tuned guitars. The vocalist has a very characteristic growl and
  48 performs in several Muscovite bands. This band also stands out because it uses
  49 piano's and synthesizers. The droning synthesizers often operate in the same
  50 frequency as the vocals.
  51
  52 The data is labeled as singing and instrumental and labeled per band. The
  53 distribution for this is shown in Table~\ref{tbl:distribution}. A random $10\%$
  54 of the data is extracted for a held out test set.
  55 \begin{table}[H]
  56         \centering
  57         \begin{tabular}{lcc}
  58                 \toprule
  59                 Instrumental & Singing\\
  60                 \midrule
  61                 0.59 & 0.41\\
  62                 \bottomrule
  63         \end{tabular}
  64         \quad
  65         \begin{tabular}{lcccc}
  66                 \toprule
  67                 Instrumental & \gls{CC} & \gls{DG} & \gls{WDISS}\\
  68                 \midrule
  69                 0.59 & 0.16 & 0.19 & 0.06\\
  70                 \bottomrule
  71         \end{tabular}
  72         \caption{Data distribution}\label{tbl:distribution}
  73 \end{table}
  74
  75 \section{\acrlong{MFCC} Features}
  76 The waveforms in itself are not very suitable to be used as features due to the
  77 high dimensionality and correlation. Therefore we use the often used
  78 \glspl{MFCC} feature vectors which have shown to be suitable~%
  79 \cite{rocamora_comparing_2007}. It has also been found that altering the mel
  80 scale to better suit singing does not yield a better
  81 performance~\cite{you_comparative_2015}. The actual conversion is done using the
  82 \emph{python\_speech\_features}%
  83 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
  84
  85 \gls{MFCC} features are inspired by human auditory processing inspired and are
  86 created from a waveform incrementally using several steps:
  87 \begin{enumerate}
  88         \item The first step in the process is converting the time representation
  89                 of the signal to a spectral representation using a sliding window with
  90                 overlap. The width of the window and the step size are two important
  91                 parameters in the system. In classical phonetic analysis window sizes
  92                 of $25ms$ with a step of $10ms$ are often chosen because they are small
  93                 enough to only contain subphone entities. Singing for $25ms$ is
  94                 impossible so it is arguable that the window size is very small.
  95         \item The standard \gls{FT} gives a spectral representation that has
  96                 linearly scaled frequencies. This scale is converted to the \gls{MS}
  97                 using triangular overlapping windows to get a more tonotopic
  98                 representation trying to match the actual representation in the cochlea
  99                 of the human ear.
 100         \item The \emph{Weber-Fechner} law describes how humans perceive physical
 101                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
 102                 Psychophysik}. They found that energy is perceived in logarithmic
 103                 increments. This means that twice the amount of decibels does not mean
 104                 twice the amount of perceived loudness. Therefore we take the log of
 105                 the energy or amplitude of the \gls{MS} spectrum to closer match the
 106                 human hearing.
 107         \item The amplitudes of the spectrum are highly correlated and therefore
 108                 the last step is a decorrelation step. \Gls{DCT} is applied on the
 109                 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
 110                 describing a signal as a combination of several primitive cosine
 111                 functions.
 112 \end{enumerate}
 113
 114 The default number of \gls{MFCC} parameters is twelve. However, often a
 115 thirteenth value is added that represents the energy in the data.
 116
 117 \section{Experimental setup}
 118 \subsection{Features}
 119 The thirteen \gls{MFCC} features are chosen to feed to the classifier. The
 120 parameters of the \gls{MFCC} features are varied in window step and window
 121 length. The default speech processing parameters are tested but also bigger
 122 window sizes since arguably the minimal size of a singing voice segment is a
 123 lot bigger than the minimal size of a subphone component on which the
 124 parameters are tuned.  The parameters chosen are as follows:
 125
 126 \begin{table}[H]
 127         \centering
 128         \begin{tabular}{lll}
 129                 \toprule
 130                 step (ms) & length (ms) & notes\\
 131                 \midrule
 132                 10 & 25 & Standard speech processing\\
 133                 40 & 100 &\\
 134                 80 & 200 &\\
 135                 \bottomrule
 136         \end{tabular}
 137         \caption{\Gls{MFCC} parameter settings}
 138 \end{table}
 139
 140 \subsection{\emph{Singing} voice detection}
 141 The first type of experiment conducted is \emph{Singing} voice detection. This
 142 is the act of segmenting an audio signal into segments that are labeled either
 143 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
 144 feature vector and the output is the probability that singing is happening in
 145 the sample. This results in an \gls{ANN} of the shape described in
 146 Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
 147
 148 \subsection{\emph{Singer} voice detection}
 149 The second type of experiment conducted is \emph{Singer} voice detection. This
 150 is the act of segmenting an audio signal into segments that are labeled either
 151 with the name of the singer or as \emph{Instrumental}. The input of the
 152 classifier is a feature vector and the outputs are probabilities for each of
 153 the singers and a probability for the instrumental label. This results in an
 154 \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
 155 is yet again thirteen and the output dimension is the number of categories. The
 156 output is encoded in one-hot encoding. This means that the categories are
 157 labeled as \texttt{1000, 0100, 0010, 0001}.
 158
 159 \subsection{\acrlong{ANN}}
 160 The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
 161 The classification problems are only binary and four-class so it is
 162 interesting to see where the bottleneck lies; how abstract can the abstraction
 163 be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
 164 using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
 165 backend that provides a high-level interface to the highly technical networks.
 166
 167 The general architecture of the networks is show in Figure~\ref{fig:bcann} and
 168 Figure~\ref{fig:mcann} for respectively the binary classification and
 169 multiclass classification. The inputs are fully connected to the hidden layer
 170 which is fully connected too the output layer. The activation function used is
 171 a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
 172 function that is also known as the ramp function. The definition is given in
 173 Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and
 174 efficient computation. The activation function between the hidden layer and the
 175 output layer is the sigmoid function in the case of binary classification. Of
 176 which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a
 177 monotonic function that is differentiable on all values of $x$ and always
 178 yields a non-negative derivative. For the multiclass classification the softmax
 179 function is used between the hidden layer and the output layer. Softmax is an
 180 activation function suitable for multiple output nodes. The definition is given
 181 in Equation~\ref{eq:softmax}.
 182
 183 The data is shuffled before fed to the network to mitigate the risk of
 184 overfitting on one album. Every model was trained using $10$ epochs and a
 185 batch size of $32$. The training set and test set are separated by taking a
 186 $90\%$ slice of all the data.
 187
 188 \begin{equation}\label{eq:relu}
 189         f(x) = \left\{\begin{array}{rcl}
 190                 0 & \text{for} & x<0\\
 191                 x & \text{for} & x \geq 0\\
 192         \end{array}\right.
 193 \end{equation}
 194
 195 \begin{equation}\label{eq:sigmoid}
 196         f(x) = \frac{1}{1+e^{-x}}
 197 \end{equation}
 198
 199 \begin{equation}\label{eq:softmax}
 200         \delta{(\boldsymbol{z})}_j = \frac{e^{z_j}}{\sum\limits^{K}_{k=1}e^{z_k}}
 201 \end{equation}
 202
 203 \begin{figure}[H]
 204         \begin{subfigure}{.5\textwidth}
 205                 \centering
 206                 \includegraphics[width=.8\linewidth]{bcann}
 207                 \caption{Binary classifier network architecture}\label{fig:bcann}
 208         \end{subfigure}%
 209 %
 210         \begin{subfigure}{.5\textwidth}
 211                 \centering
 212                 \includegraphics[width=.8\linewidth]{mcann}
 213                 \caption{Multiclass classifier network architecture}\label{fig:mcann}
 214         \end{subfigure}
 215         \caption{\acrlong{ANN} architectures.}
 216 \end{figure}
 217
 218 \section{Results}
 219 \subsection{\emph{Singing} voice detection}
 220 Table~\ref{tbl:singing} shows the results for the singing-voice detection.
 221 Figure~\ref{fig:bclass} shows an example of a segment of a song with the
 222 classifier plotted underneath to illustrate the performance.
 223
 224 \begin{table}[H]
 225         \centering
 226         \begin{tabular}{rccc}
 227                 \toprule
 228                    & \multicolumn{3}{c}{Parameters (step/length)}\\
 229                     & 10/25 & 40/100 & 80/200\\
 230                 \midrule
 231                 3h  & 0.86 (0.34) & 0.87 (0.32) & 0.85 (0.35)\\
 232                 5h  & 0.87 (0.31) & 0.88 (0.30) & 0.87 (0.32)\\
 233                 8h  & 0.88 (0.30) & 0.88 (0.31) & 0.88 (0.29)\\
 234                 13h & 0.89 (0.28) & 0.89 (0.29) & 0.88 (0.30)\\
 235                 \bottomrule
 236         \end{tabular}
 237         \caption{Binary classification results (accuracy
 238                 (loss))}\label{tbl:singing}
 239 \end{table}
 240
 241 \begin{figure}[H]
 242         \centering
 243         \includegraphics[width=.7\linewidth]{bclass}.
 244         \caption{Plotting the classifier under the audio signal}\label{fig:bclass}
 245 \end{figure}
 246
 247 \subsection{\emph{Singer} voice detection}
 248 Table~\ref{tbl:singer} shows the results for the singer-voice detection.
 249
 250 \begin{table}[H]
 251         \centering
 252         \begin{tabular}{rccc}
 253                 \toprule
 254                    & \multicolumn{3}{c}{Parameters (step/length)}\\
 255                     & 10/25 & 40/100 & 80/200\\
 256                 \midrule
 257                 3h  & 0.83 (0.48) & 0.82 (0.48) & 0.82 (0.48)\\
 258                 5h  & 0.85 (0.43) & 0.84 (0.44) & 0.84 (0.44)\\
 259                 8h  & 0.86 (0.41) & 0.86 (0.39) & 0.86 (0.40)\\
 260                 13h & 0.87 (0.37) & 0.87 (0.38) & 0.86 (0.39)\\
 261                 \bottomrule
 262         \end{tabular}
 263         \caption{Multiclass classification results (accuracy
 264                 (loss))}\label{tbl:singer}
 265 \end{table}
 266
 267 \subsection{Alien data}
 268 To test the generalizability of the models the system is tested on alien data.
 269 The data was retrieved from the album \emph{The Desperation} by \emph{Godless
 270 Truth}. \emph{Godless Truth} is a so called old-school \gls{dm} band that has
 271 very raspy vocals and the vocals are very up front in the mastering. This means
 272 that the vocals are very prevalent in the recording and therefore no difficulty
 273 is expected for the classifier. Figure~\ref{fig:alien1} shows that indeed the
 274 classifier scores very accurately. Note that the spectogram settings have been
 275 adapted a little bit to make the picture more clear. The spectogram shows the
 276 frequency range from $0$ to $3000Hz$.
 277
 278 \begin{figure}[H]
 279         \centering
 280         \includegraphics[width=.7\linewidth]{alien1}.
 281         \caption{Plotting the classifier under similar alien data}\label{fig:alien1}
 282 \end{figure}
 283
 284 To really test the limits, a song from the highly atmospheric doom metal band
 285 called \emph{Catacombs} has been tested on the system. The album \emph{Echoes
 286 Through the Catacombs} is an album that has a lot of synthesizers, heavy
 287 droning guitars and bass lines. The vocals are not mixed in a way that makes
 288 them stand out. The models have never seen trainingsdata that is even remotely
 289 similar to this type of metal. Figure~\ref{fig:alien2} shows a segment of the
 290 data. Here it is clearly visible that the classifier can not distinguish
 291 singing from non singing.
 292
 293 \begin{figure}[H]
 294         \centering
 295         \includegraphics[width=.7\linewidth]{alien2}.
 296         \caption{Plotting the classifier under different alien data}\label{fig:alien2}
 297 \end{figure}