\begin{figure}[ht]
\centering
\includegraphics[width=.7\linewidth]{cement}
- \caption{A vocal segment of the \acrlong{CC} song
+ \caption{A vocal segment of the Cannibal Corpse song
\emph{Bloodstained Cement}}\label{fig:bloodstained}
\end{figure}
\begin{figure}[ht]
\centering
\includegraphics[width=.7\linewidth]{abominations}
- \caption{A vocal segment of the \acrlong{DG} song
+ \caption{A vocal segment of the Disgorge song
\emph{Enthroned Abominations}}\label{fig:abominations}
\end{figure}
piano's and synthesizers. The droning synthesizers often operate in the same
frequency as the vocals.
-Additional detailss about the dataset are listed in Appendix~\ref{app:data}.
+Additional details about the dataset are listed in Appendix~\ref{app:data}.
The data is labeled as singing and instrumental and labeled per band. The
distribution for this is shown in Table~\ref{tbl:distribution}.
\begin{table}[H]
\caption{Data distribution}\label{tbl:distribution}
\end{table}
-\section{\acrlong{MFCC} Features}
+\section{Mel-frequencey Cepstral Features}
The waveforms in itself are not very suitable to be used as features due to the
high dimensionality and correlation in the temporal domain. Therefore we use
the often used \glspl{MFCC} feature vectors which have shown to be suitable in
magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
Psychophysik}. They found that energy is perceived in logarithmic
increments. This means that twice the amount of energy does not mean
- twice the amount of perceived loudness. Therefore we take the log of
- the energy or amplitude of the \gls{MS} spectrum to closer match the
+ twice the amount of perceived loudness. Therefore we take the logarithm
+ of the energy or amplitude of the \gls{MS} spectrum to closer match the
human hearing.
\item The amplitudes of the spectrum are highly correlated and therefore
the last step is a decorrelation step. \Gls{DCT} is applied on the
represents the overall energy in the \gls{MS}. Another option would be
$\log{(E)}$ which is the logarithm of the raw energy of the sample.
-\section{\acrlong{ANN}}
+\section{Artificial Neural Network}
The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
-The classification problems are only binary and four-class so it is
+The classification problems are only binary or four-class problems so it is
interesting to see where the bottleneck lies; how abstract can the abstraction
be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
which is fully connected too the output layer. The activation function used is
a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
function that is also known as the ramp function. The definition is given in
-Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and
-efficient computation. The activation function between the hidden layer and the
-output layer is the sigmoid function in the case of binary classification, of
-which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a
-monotonic function that is differentiable on all values of $x$ and always
-yields a non-negative derivative. For the multiclass classification the softmax
-function is used between the hidden layer and the output layer. Softmax is an
-activation function suitable for multiple output nodes. The definition is given
-in Equation~\ref{eq:softmax}.
+Equation~\ref{eq:relu}. \gls{RELU} has the downside that it can create
+unreachable nodes in a deep network. This is not a problem in this network
+since it only has one hidden layer. \gls{RELU} was also chosen because of its
+efficient computation and nature inspiredness.
+
+The activation function between the hidden layer and the output layer is the
+sigmoid function in the case of binary classification, of which the definition
+is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a monotonic function that
+is differentiable on all values of $x$ and always yields a non-negative
+derivative. For the multiclass classification the softmax function is used
+between the hidden layer and the output layer. Softmax is an activation
+function suitable for multiple output nodes. The definition is given in
+Equation~\ref{eq:softmax}.
The data is shuffled before fed to the network to mitigate the risk of
-overfitting on one album. Every model was trained using $10$ epochs and a
-batch size of $32$. The training set and test set are separated by taking a
-$90\%$ slice of all the data.
+overfitting on one album. Every model was trained using $10$ epochs which means
+that all training data is offered to the model $10$ times. The training set and
+test set are separated by taking a $90\%$ slice of all the data.
\begin{equation}\label{eq:relu}
f(x) = \left\{\begin{array}{rcl}
\includegraphics[width=.8\linewidth]{mcann}
\caption{Multiclass classifier network architecture}\label{fig:mcann}
\end{subfigure}
- \caption{\acrlong{ANN} architectures.}
+ \caption{Artificial Neural Network architectures.}
\end{figure}
\section{Experimental setup}
\subsection{Features}
-The thirteen \gls{MFCC} features are used as the input. Th parameters of the
+The thirteen \gls{MFCC} features are used as the input. The parameters of the
\gls{MFCC} features are varied in window step and window length. The default
speech processing parameters are tested but also bigger window sizes since
arguably the minimal size of a singing voice segment is a lot bigger than the