true final

[asr1617.git] / methods.tex
diff --git a/methods.tex b/methods.tex

index 45b677e..517ade5 100644 (file)
--- a/methods.tex
+++ b/methods.tex
@@ -2,58 +2,62 @@
  
  %Experiment(s) (set-up, data, results, discussion)
  \section{Data \& Preprocessing}
-To run the experiments data has been collected from several \gls{dm} albums.
-The exact data used is available in Appendix~\ref{app:data}. The albums are
-extracted from the audio CD and converted to a mono channel waveform with the
-correct samplerate utilizing \emph{SoX}%
-\footnote{\url{http://sox.sourceforge.net/}}.  Every file is annotated using
-Praat\cite{boersma_praat_2002} where the utterances are manually aligned to the
-audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and
-Figure~\ref{fig:abominations} where the waveform, $1-8000$Hz spectrals and
-annotations are shown. It is clearly visible that within the genre of death
-metal there are different spectral patterns visible over time.
+To answer the research question, several experiments have been performed. Data
+has been collected from several \gls{dm} and \gls{dom} albums. The exact data
+used is available in Appendix~\ref{app:data}. The albums are extracted from the
+audio CD and converted to a mono channel waveform with the correct samplerate
+utilizing \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. Every file
+is annotated using Praat~\cite{boersma_praat_2002} where the lyrics are
+manually aligned to the audio. Examples of utterances are shown in
+Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
+waveform, $1-8000$Hz spectrogram and annotations are shown. It is clearly
+visible that within the genre of death metal there are different spectral
+patterns visible over time.
  
  \begin{figure}[ht]
         \centering
         \includegraphics[width=.7\linewidth]{cement}
-       \caption{A vocal segment of the \acrlong{CC} song
+       \caption{A vocal segment of the Cannibal Corpse song
                 \emph{Bloodstained Cement}}\label{fig:bloodstained}
  \end{figure}
  
  \begin{figure}[ht]
         \centering
         \includegraphics[width=.7\linewidth]{abominations}
-       \caption{A vocal segment of the \acrlong{DG} song
+       \caption{A vocal segment of the Disgorge song
                 \emph{Enthroned Abominations}}\label{fig:abominations}
  \end{figure}
  
  The data is collected from three studio albums. The first band is called
  \gls{CC} and has been producing \gls{dm} for almost 25 years and has been
-creating album with a consistent style. The singer of \gls{CC} has a very raspy
-growl and the lyrics are quite comprehensible. The vocals produced by \gls{CC}
-border regular shouting. 
+creating albums with a consistent style. The singer of \gls{CC} has a very
+raspy growl and the lyrics are quite comprehensible. The vocals produced by
+\gls{CC} are very close to regular shouting.
  
  The second band is called \gls{DG} and makes even more violently
  sounding music. The growls of the lead singer sound like a coffee grinder and
-are more shallow. In the spectrals it is clearly visible that there are
+are sound less full. In the spectrals it is clearly visible that there are
  overtones produced during some parts of the growling. The lyrics are completely
  incomprehensible and therefore some parts were not annotated with the actual
  lyrics because it was impossible to hear what was being sung.
  
-Lastly a band from Moscow is chosen bearing the name \gls{WDISS}. This band is
-a little odd compared to the previous \gls{dm} bands because they create
-\gls{dom}. \gls{dom} is characterized by the very slow tempo and low tuned
-guitars. The vocalist has a very characteristic growl and performs in several
-Muscovite bands. This band also stands out because it uses piano's and
-synthesizers. The droning synthesizers often operate in the same frequency as
-the vocals.
-
-The training and test data is divided as follows:
+The third band --- originating from Moscow --- is chosen bearing the name
+\gls{WDISS}. This band is a little odd compared to the previous \gls{dm} bands
+because they create \gls{dom}. \gls{dom} is characterized by the very slow
+tempo and low tuned guitars. The vocalist has a very characteristic growl and
+performs in several Muscovite bands. This band also stands out because it uses
+piano's and synthesizers. The droning synthesizers often operate in the same
+frequency as the vocals.
+
+Additional details about the dataset such are listed in
+Appendix~\ref{app:data}.  The data is labeled as singing and instrumental and
+labeled per band. The distribution for this is shown in
+Table~\ref{tbl:distribution}.
  \begin{table}[H]
         \centering
         \begin{tabular}{lcc}
                 \toprule
-               Singing & Instrumental\\
+               Instrumental & Singing\\
                 \midrule
                 0.59 & 0.41\\
                 \bottomrule
@@ -66,39 +70,39 @@ The training and test data is divided as follows:
                 0.59 & 0.16 & 0.19 & 0.06\\
                 \bottomrule
         \end{tabular}
+       \caption{Proportional data distribution}\label{tbl:distribution}
  \end{table}
  
-\section{\acrlong{MFCC} Features}
+\section{Mel-frequencey Cepstral Features}
  The waveforms in itself are not very suitable to be used as features due to the
-high dimensionality and correlation. Therefore we use the often used
-\glspl{MFCC} feature vectors which have shown to be suitable%
-\cite{rocamora_comparing_2007}. It has also been found that altering the mel
-scale to better suit singing does not yield a better
-performance\cite{you_comparative_2015}. The actual conversion is done using the
-\emph{python\_speech\_features}%
-\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
-
-\gls{MFCC} features are inspired by human auditory processing inspired and are
+high dimensionality and correlation in the temporal domain. Therefore we use
+the often used \glspl{MFCC} feature vectors which have shown to be suitable in
+speech processing~\cite{rocamora_comparing_2007}. It has also been found that
+altering the mel scale to better suit singing does not yield a better
+performance~\cite{you_comparative_2015}. The actual conversion is done using
+the \emph{python\_speech\_features}\footnote{\url{%
+https://github.com/jameslyons/python_speech_features}} package.
+
+\gls{MFCC} features are inspired by human auditory processing and are
  created from a waveform incrementally using several steps:
  \begin{enumerate}
         \item The first step in the process is converting the time representation
-               of the signal to a spectral representation using a sliding window with
-               overlap. The width of the window and the step size are two important
-               parameters in the system. In classical phonetic analysis window sizes
-               of $25ms$ with a step of $10ms$ are often chosen because they are small
-               enough to only contain subphone entities. Singing for $25ms$ is
-               impossible so it is arguable that the window size is very small.
+               of the signal to a spectral representation using a sliding analysis
+               window with overlap. The width of the window and the step size are two
+               important parameters in the system. In classical phonetic analysis
+               window sizes of $25ms$ with a step of $10ms$ are often chosen because
+               they are small enough to contain just one subphone event.
         \item The standard \gls{FT} gives a spectral representation that has
                 linearly scaled frequencies. This scale is converted to the \gls{MS}
                 using triangular overlapping windows to get a more tonotopic
-               representation trying to match the actual representation in the cochlea
-               of the human ear.
+               representation trying to match the actual representation of the cochlea
+               in the human ear.
         \item The \emph{Weber-Fechner} law describes how humans perceive physical
                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
                 Psychophysik}. They found that energy is perceived in logarithmic
-               increments. This means that twice the amount of decibels does not mean
-               twice the amount of perceived loudness. Therefore we take the log of
-               the energy or amplitude of the \gls{MS} spectrum to closer match the
+               increments. This means that twice the amount of energy does not mean
+               twice the amount of perceived loudness. Therefore we take the logarithm
+               of the energy or amplitude of the \gls{MS} spectrum to closer match the
                 human hearing.
         \item The amplitudes of the spectrum are highly correlated and therefore
                 the last step is a decorrelation step. \Gls{DCT} is applied on the
@@ -108,53 +112,14 @@ created from a waveform incrementally using several steps:
  \end{enumerate}
  
  The default number of \gls{MFCC} parameters is twelve. However, often a
-thirteenth value is added that represents the energy in the data.
-
-\section{Experimental setup}
-\subsection{Features}
-The thirteen \gls{MFCC} features are chosen to feed to the classifier. The
-parameters of the \gls{MFCC} features are varied in window step and window
-length. The default speech processing parameters are tested but also bigger
-window sizes since arguably the minimal size of a singing voice segment is a
-lot bigger than the minimal size of a subphone component on which the
-parameters are tuned.  The parameters chosen are as follows:
-
-\begin{table}[H]
-       \centering
-       \begin{tabular}{lll}
-               \toprule
-               step (ms) & length (ms) & notes\\
-               \midrule
-               10 & 25 & Standard speech processing\\
-               40 & 100 &\\ 
-               80 & 200 &\\
-               \bottomrule
-       \end{tabular}
-       \caption{\Gls{MFCC} parameter settings}
-\end{table}
+thirteenth value is added that represents the energy in the analysis window.
+The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It
+represents the average over all \gls{MS} bands. Another option would be
+$\log{(E)}$ which is the logarithm of the raw energy of the sample.
  
-\subsection{\emph{Singing} voice detection}
-The first type of experiment conducted is \emph{Singing} voice detection. This
-is the act of segmenting an audio signal into segments that are labeled either
-as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
-feature vector and the output is the probability that singing is happening in
-the sample. This results in an \gls{ANN} of the shape described in
-Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
-
-\subsection{\emph{Singer} voice detection}
-The second type of experiment conducted is \emph{Singer} voice detection. This
-is the act of segmenting an audio signal into segments that are labeled either
-with the name of the singer or as \emph{Instrumental}. The input of the
-classifier is a feature vector and the outputs are probabilities for each of
-the singers and a probability for the instrumental label. This results in an
-\gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
-is yet again thirteen and the output dimension is the number of categories. The
-output is encoded in one-hot encoding. This means that the categories are
-labeled as \texttt{1000, 0100, 0010, 0001}.
-
-\subsection{\acrlong{ANN}}
+\section{Artificial Neural Network}
  The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
-The classification problems are only binary and four-class so it is
+The classification problems are only binary or four-class problems so it is
  interesting to see where the bottleneck lies; how abstract can the abstraction
  be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
  using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
@@ -166,19 +131,24 @@ multiclass classification. The inputs are fully connected to the hidden layer
  which is fully connected too the output layer. The activation function used is
  a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
  function that is also known as the ramp function. The definition is given in
-Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and
-efficient computation. The activation function between the hidden layer and the
-output layer is the sigmoid function in the case of binary classification. Of
-which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a
-monotonic function that is differentiable on all values of $x$ and always
-yields a non-negative derivative. For the multiclass classification the softmax
-function is used between the hidden layer and the output layer. Softmax is an
-activation function suitable for multiple output nodes. The definition is given
-in Equation~\ref{eq:softmax}.
+Equation~\ref{eq:relu}. \gls{RELU} has the downside that it can create
+unreachable nodes in a deep network. This is not a problem in this network
+since it only has one hidden layer. \gls{RELU} was also chosen because of its
+efficient computation and nature inspiredness. 
+
+The activation function between the hidden layer and the output layer is the
+sigmoid function in the case of binary classification, of which the definition
+is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a monotonic function that
+is differentiable on all values of $x$ and always yields a non-negative
+derivative. For the multiclass classification the softmax function is used
+between the hidden layer and the output layer. Softmax is an activation
+function suitable for multiple output nodes. The definition is given in
+Equation~\ref{eq:softmax}.
  
  The data is shuffled before fed to the network to mitigate the risk of
-overfitting on one album. Every model was trained using $10$ epochs and a
-batch size of $32$.
+overfitting on one album. Every model was trained using $10$ epochs which means
+that all training data is offered to the model $10$ times. The training set and
+test set are separated by taking a $90\%$ slice of all the data.
  
  \begin{equation}\label{eq:relu}
         f(x) = \left\{\begin{array}{rcl}
@@ -207,86 +177,72 @@ batch size of $32$.
                 \includegraphics[width=.8\linewidth]{mcann}
                 \caption{Multiclass classifier network architecture}\label{fig:mcann}
         \end{subfigure}
-       \caption{\acrlong{ANN} architectures.}
+       \caption{Artificial Neural Network architectures.}
  \end{figure}
  
-\section{Results}
-\subsection{\emph{Singing} voice detection}
-Table~\ref{tbl:singing} shows the results for the singing-voice detection.
-Figure~\ref{fig:bclass} shows an example of a segment of a song with the
-classifier plotted underneath to illustrate the performance.
+\section{Experimental setup}
+\subsection{Features}
+The thirteen \gls{MFCC} features are used as the input. The parameters of the
+\gls{MFCC} features are varied in window step and window length. The default
+speech processing parameters are tested but also bigger window sizes since
+arguably the minimal size of a singing voice segment is a lot bigger than the
+minimal size of a subphone component on which the parameters are tuned. The
+parameters chosen are as follows:
  
  \begin{table}[H]
         \centering
-       \begin{tabular}{rccc}
+       \begin{tabular}{lll}
                 \toprule
-                  & \multicolumn{3}{c}{Parameters (step/length)}\\
-                   & 10/25 & 40/100 & 80/200\\
+               step (ms) & length (ms) & notes\\
                 \midrule
-               3h  & 0.86 (0.34) & 0.87 (0.32) & 0.85 (0.35)\\
-               5h  & 0.87 (0.31) & 0.88 (0.30) & 0.87 (0.32)\\
-               8h  & 0.88 (0.30) & 0.88 (0.31) & 0.88 (0.29)\\
-               13h & 0.89 (0.28) & 0.89 (0.29) & 0.88 (0.30)\\
+               10 & 25 & Standard speech processing\\
+               40 & 100 &\\ 
+               80 & 200 &\\
                 \bottomrule
         \end{tabular}
-       \caption{Binary classification results (accuracy
-               (loss))}\label{tbl:singing}
+       \caption{\Gls{MFCC} parameter settings}
  \end{table}
  
-\begin{figure}[H]
-       \centering
-       \includegraphics[width=.7\linewidth]{bclass}.
-       \caption{Plotting the classifier under the audio signal}\label{fig:bclass}
-\end{figure}
+\subsection{\emph{Singing}-voice detection}
+The first type of experiment conducted is \emph{Singing}-voice detection. This
+is the act of segmenting an audio signal into segments that are labeled either
+as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
+feature vector and the output is the probability that singing is happening in
+the sample. This results in an \gls{ANN} of the shape described in
+Figure~\ref{fig:bcann}. The input dimension is thirteen and the output
+dimension is one.
  
-\subsection{\emph{Singer} voice detection}
-Table~\ref{tbl:singer} shows the results for the singer-voice detection.
+The \emph{crosstenopy} function is used as the loss function. The
+formula is shown in Equation~\ref{eq:bincross} where $p$ is the true
+distribution and $q$ is the classification. Acurracy is the mean of the
+absolute differences between prediction and true value. The formula is show in
+Equation~\ref{eq:binacc}.
  
-\begin{table}[H]
-       \centering
-       \begin{tabular}{rccc}
-               \toprule
-                  & \multicolumn{3}{c}{Parameters (step/length)}\\
-                   & 10/25 & 40/100 & 80/200\\
-               \midrule
-               3h  & 0.83 (0.48) & 0.82 (0.48) & 0.82 (0.48)\\
-               5h  & 0.85 (0.43) & 0.84 (0.44) & 0.84 (0.44)\\
-               8h  & 0.86 (0.41) & 0.86 (0.39) & 0.86 (0.40)\\
-               13h & 0.87 (0.37) & 0.87 (0.38) & 0.86 (0.39)\\
-               \bottomrule
-       \end{tabular}
-       \caption{Multiclass classification results (accuracy
-               (loss))}\label{tbl:singer}
-\end{table}
+\begin{equation}\label{eq:bincross}
+       H(p,q) = -\sum_x p(x)\log{q(x)}
+\end{equation}
  
-\subsection{Alien data}
-To test the generalizability of the models the system is tested on alien data.
-The data was retrieved from the album \emph{The Desperation} by \emph{Godless
-Truth}. \emph{Godless Truth} is a so called old-school \gls{dm} band that has
-very raspy vocals and the vocals are very up front in the mastering. This means
-that the vocals are very prevalent in the recording and therefore no difficulty
-is expected for the classifier. Figure~\ref{fig:alien1} shows that indeed the
-classifier scores very accurately. Note that the spectogram settings have been
-adapted a little bit to make the picture more clear. The spectogram shows the
-frequency range from $0$ to $3000Hz$.
+\begin{equation}\label{eq:binacc}
+       \frac{1}{n}\sum^n_{i=1} abs (ypred_i-y_i)
+\end{equation}
  
-\begin{figure}[H]
-       \centering
-       \includegraphics[width=.7\linewidth]{alien1}.
-       \caption{Plotting the classifier under similar alien data}\label{fig:alien1}
-\end{figure}
+\subsection{\emph{Singer}-voice detection}
+The second type of experiment conducted is \emph{Singer}-voice detection. This
+is the act of segmenting an audio signal into segments that are labeled either
+with the name of the singer or as \emph{Instrumental}. The input of the
+classifier is a feature vector and the outputs are probabilities for each of
+the singers and a probability for the instrumental label. This results in an
+\gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
+is yet again thirteen and the output dimension is the number of categories. The
+output is encoded in one-hot encoding. This means that the four categories in
+the experiments are labeled as \texttt{1000, 0100, 0010, 0001}.
  
-To really test the limits, a song from the highly atmospheric doom metal band
-called \emph{Catacombs} has been tested on the system. The album \emph{Echoes
-Through the Catacombs} is an album that has a lot of synthesizers, heavy
-droning guitars and bass lines. The vocals are not mixed in a way that makes
-them stand out. The models have never seen trainingsdata that is even remotely
-similar to this type of metal. Figure~\ref{fig:alien2} shows a segment of the
-data. Here it is clearly visible that the classifier can not distinguish
-singing from non singing.
+The loss function is the same as in \emph{Singing}-voice detection.
+The accuracy is calculated a little differenty since the output of the network
+is not one probability but a vector of probabilities. The accuracy is
+calculated of each sample by only taking into account the highest value in the
+one-hot encoded vector. This exact formula is shown in Equation~\ref{eq:catacc}.
  
-\begin{figure}[H]
-       \centering
-       \includegraphics[width=.7\linewidth]{alien2}.
-       \caption{Plotting the classifier under different alien data}\label{fig:alien2}
-\end{figure}
+\begin{equation}\label{eq:catacc}
+       \frac{1}{n}\sum^n_{i=1} abs(argmax(ypred_i)-argmax(y_i))
+\end{equation}