methods.tex

   1 %Methodology
   2
   3 %Experiment(s) (set-up, data, results, discussion)
   4 \section{Data \& Preprocessing}
   5 To answer the research question, several experiments have been performed. Data
   6 has been collected from several \gls{dm} and \gls{dom} albums. The exact data
   7 used is available in Appendix~\ref{app:data}. The albums are extracted from the
   8 audio CD and converted to a mono channel waveform with the correct samplerate
   9 utilizing \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. Every file
  10 is annotated using Praat~\cite{boersma_praat_2002} where the lyrics are
  11 manually aligned to the audio. Examples of utterances are shown in
  12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
  13 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
  14 that within the genre of death metal there are different spectral patterns
  15 visible over time.
  16
  17 \begin{figure}[ht]
  18         \centering
  19         \includegraphics[width=.7\linewidth]{cement}
  20         \caption{A vocal segment of the \acrlong{CC} song
  21                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  22 \end{figure}
  23
  24 \begin{figure}[ht]
  25         \centering
  26         \includegraphics[width=.7\linewidth]{abominations}
  27         \caption{A vocal segment of the \acrlong{DG} song
  28                 \emph{Enthroned Abominations}}\label{fig:abominations}
  29 \end{figure}
  30
  31 The data is collected from three studio albums. The first band is called
  32 \gls{CC} and has been producing \gls{dm} for almost 25 years and has been
  33 creating albums with a consistent style. The singer of \gls{CC} has a very
  34 raspy growl and the lyrics are quite comprehensible. The vocals produced by
  35 \gls{CC} are very close to regular shouting.
  36
  37 The second band is called \gls{DG} and makes even more violently
  38 sounding music. The growls of the lead singer sound like a coffee grinder and
  39 are sound less full. In the spectrals it is clearly visible that there are
  40 overtones produced during some parts of the growling. The lyrics are completely
  41 incomprehensible and therefore some parts were not annotated with the actual
  42 lyrics because it was impossible to hear what was being sung.
  43
  44 The third band --- originating from Moscow --- is chosen bearing the name
  45 \gls{WDISS}. This band is a little odd compared to the previous \gls{dm} bands
  46 because they create \gls{dom}. \gls{dom} is characterized by the very slow
  47 tempo and low tuned guitars. The vocalist has a very characteristic growl and
  48 performs in several Muscovite bands. This band also stands out because it uses
  49 piano's and synthesizers. The droning synthesizers often operate in the same
  50 frequency as the vocals.
  51
  52 Additional detailss about the dataset are listed in Appendix~\ref{app:data}.
  53 The data is labeled as singing and instrumental and labeled per band. The
  54 distribution for this is shown in Table~\ref{tbl:distribution}.
  55 \begin{table}[H]
  56         \centering
  57         \begin{tabular}{lcc}
  58                 \toprule
  59                 Instrumental & Singing\\
  60                 \midrule
  61                 0.59 & 0.41\\
  62                 \bottomrule
  63         \end{tabular}
  64         \quad
  65         \begin{tabular}{lcccc}
  66                 \toprule
  67                 Instrumental & \gls{CC} & \gls{DG} & \gls{WDISS}\\
  68                 \midrule
  69                 0.59 & 0.16 & 0.19 & 0.06\\
  70                 \bottomrule
  71         \end{tabular}
  72         \caption{Data distribution}\label{tbl:distribution}
  73 \end{table}
  74
  75 \section{\acrlong{MFCC} Features}
  76 The waveforms in itself are not very suitable to be used as features due to the
  77 high dimensionality and correlation in the temporal domain. Therefore we use
  78 the often used \glspl{MFCC} feature vectors which have shown to be suitable in
  79 speech processing~\cite{rocamora_comparing_2007}. It has also been found that
  80 altering the mel scale to better suit singing does not yield a better
  81 performance~\cite{you_comparative_2015}. The actual conversion is done using
  82 the \emph{python\_speech\_features}\footnote{\url{%
  83 https://github.com/jameslyons/python_speech_features}} package.
  84
  85 \gls{MFCC} features are inspired by human auditory processing and are
  86 created from a waveform incrementally using several steps:
  87 \begin{enumerate}
  88         \item The first step in the process is converting the time representation
  89                 of the signal to a spectral representation using a sliding analysis
  90                 window with overlap. The width of the window and the step size are two
  91                 important parameters in the system. In classical phonetic analysis
  92                 window sizes of $25ms$ with a step of $10ms$ are often chosen because
  93                 they are small enough to contain just one subphone event. Singing for
  94                 $25ms$ is impossible so it might be necessary to increase the window
  95                 size.
  96         \item The standard \gls{FT} gives a spectral representation that has
  97                 linearly scaled frequencies. This scale is converted to the \gls{MS}
  98                 using triangular overlapping windows to get a more tonotopic
  99                 representation trying to match the actual representation of the cochlea
 100                 in the human ear.
 101         \item The \emph{Weber-Fechner} law describes how humans perceive physical
 102                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
 103                 Psychophysik}. They found that energy is perceived in logarithmic
 104                 increments. This means that twice the amount of energy does not mean
 105                 twice the amount of perceived loudness. Therefore we take the log of
 106                 the energy or amplitude of the \gls{MS} spectrum to closer match the
 107                 human hearing.
 108         \item The amplitudes of the spectrum are highly correlated and therefore
 109                 the last step is a decorrelation step. \Gls{DCT} is applied on the
 110                 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
 111                 describing a signal as a combination of several primitive cosine
 112                 functions.
 113 \end{enumerate}
 114
 115 The default number of \gls{MFCC} parameters is twelve. However, often a
 116 thirteenth value is added that represents the energy in the analysis window.
 117 The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It
 118 represents the overall energy in the \gls{MS}. Another option would be
 119 $\log{(E)}$ which is the logarithm of the raw energy of the sample.
 120
 121 \section{\acrlong{ANN}}
 122 The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
 123 The classification problems are only binary and four-class so it is
 124 interesting to see where the bottleneck lies; how abstract can the abstraction
 125 be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
 126 using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
 127 backend that provides a high-level interface to the highly technical networks.
 128
 129 The general architecture of the networks is show in Figure~\ref{fig:bcann} and
 130 Figure~\ref{fig:mcann} for respectively the binary classification and
 131 multiclass classification. The inputs are fully connected to the hidden layer
 132 which is fully connected too the output layer. The activation function used is
 133 a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
 134 function that is also known as the ramp function. The definition is given in
 135 Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and
 136 efficient computation. The activation function between the hidden layer and the
 137 output layer is the sigmoid function in the case of binary classification, of
 138 which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a
 139 monotonic function that is differentiable on all values of $x$ and always
 140 yields a non-negative derivative. For the multiclass classification the softmax
 141 function is used between the hidden layer and the output layer. Softmax is an
 142 activation function suitable for multiple output nodes. The definition is given
 143 in Equation~\ref{eq:softmax}.
 144
 145 The data is shuffled before fed to the network to mitigate the risk of
 146 overfitting on one album. Every model was trained using $10$ epochs and a
 147 batch size of $32$. The training set and test set are separated by taking a
 148 $90\%$ slice of all the data.
 149
 150 \begin{equation}\label{eq:relu}
 151         f(x) = \left\{\begin{array}{rcl}
 152                 0 & \text{for} & x<0\\
 153                 x & \text{for} & x \geq 0\\
 154         \end{array}\right.
 155 \end{equation}
 156
 157 \begin{equation}\label{eq:sigmoid}
 158         f(x) = \frac{1}{1+e^{-x}}
 159 \end{equation}
 160
 161 \begin{equation}\label{eq:softmax}
 162         \delta{(\boldsymbol{z})}_j = \frac{e^{z_j}}{\sum\limits^{K}_{k=1}e^{z_k}}
 163 \end{equation}
 164
 165 \begin{figure}[H]
 166         \begin{subfigure}{.5\textwidth}
 167                 \centering
 168                 \includegraphics[width=.8\linewidth]{bcann}
 169                 \caption{Binary classifier network architecture}\label{fig:bcann}
 170         \end{subfigure}%
 171 %
 172         \begin{subfigure}{.5\textwidth}
 173                 \centering
 174                 \includegraphics[width=.8\linewidth]{mcann}
 175                 \caption{Multiclass classifier network architecture}\label{fig:mcann}
 176         \end{subfigure}
 177         \caption{\acrlong{ANN} architectures.}
 178 \end{figure}
 179
 180 \section{Experimental setup}
 181 \subsection{Features}
 182 The thirteen \gls{MFCC} features are used as the input.  Th parameters of the
 183 \gls{MFCC} features are varied in window step and window length. The default
 184 speech processing parameters are tested but also bigger window sizes since
 185 arguably the minimal size of a singing voice segment is a lot bigger than the
 186 minimal size of a subphone component on which the parameters are tuned. The
 187 parameters chosen are as follows:
 188
 189 \begin{table}[H]
 190         \centering
 191         \begin{tabular}{lll}
 192                 \toprule
 193                 step (ms) & length (ms) & notes\\
 194                 \midrule
 195                 10 & 25 & Standard speech processing\\
 196                 40 & 100 &\\
 197                 80 & 200 &\\
 198                 \bottomrule
 199         \end{tabular}
 200         \caption{\Gls{MFCC} parameter settings}
 201 \end{table}
 202
 203 \subsection{\emph{Singing}-voice detection}
 204 The first type of experiment conducted is \emph{Singing}-voice detection. This
 205 is the act of segmenting an audio signal into segments that are labeled either
 206 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
 207 feature vector and the output is the probability that singing is happening in
 208 the sample. This results in an \gls{ANN} of the shape described in
 209 Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
 210
 211 \subsection{\emph{Singer}-voice detection}
 212 The second type of experiment conducted is \emph{Singer}-voice detection. This
 213 is the act of segmenting an audio signal into segments that are labeled either
 214 with the name of the singer or as \emph{Instrumental}. The input of the
 215 classifier is a feature vector and the outputs are probabilities for each of
 216 the singers and a probability for the instrumental label. This results in an
 217 \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
 218 is yet again thirteen and the output dimension is the number of categories. The
 219 output is encoded in one-hot encoding. This means that the categories are
 220 labeled as \texttt{1000, 0100, 0010, 0001}.