The data is collected from three studio albums. The first band is called
\emph{Cannibal Corpse} and has been producing \gls{dm} for almost 25 years and
-have been creating the same type every album. The singer of \emph{Cannibal
-Corpse} has a very raspy growls and the lyrics are quite comprehensible. The
-vocals produced by \emph{Cannibal Corpse} are bordering regular shouting.
+has been creating album with a consistent style. The singer of \emph{Cannibal
+Corpse} has a very raspy growl and the lyrics are quite comprehensible. The
+vocals produced by \emph{Cannibal Corpse} border regular shouting.
-The second band is called \emph{Disgorge} and make even more violently sounding
-music. The growls of the lead singer sound like a coffee grinder and are more
-shallow. In the spectrals it is clearly visible that there are overtones
-produced during some parts of the growling. The lyrics are completely
+The second band is called \emph{Disgorge} and makes even more violently
+sounding music. The growls of the lead singer sound like a coffee grinder and
+are more shallow. In the spectrals it is clearly visible that there are
+overtones produced during some parts of the growling. The lyrics are completely
incomprehensible and therefore some parts were not annotated with the actual
-lyrics because it was not possible what was being sung.
+lyrics because it was impossible to hear what was being sung.
Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
\section{\acrlong{MFCC} Features}
The waveforms in itself are not very suitable to be used as features due to the
high dimensionality and correlation. Therefore we use the often used
-\glspl{MFCC} feature vectors which has shown to be
-suitable\cite{rocamora_comparing_2007}. It has also been found that altering
-the mel scale to better suit singing does not yield a better
+\glspl{MFCC} feature vectors which have shown to be suitable%
+\cite{rocamora_comparing_2007}. It has also been found that altering the mel
+scale to better suit singing does not yield a better
performance\cite{you_comparative_2015}. The actual conversion is done using the
\emph{python\_speech\_features}%
\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
-\gls{MFCC} features are inspired by human auditory processing inspired and
-built incrementally in several steps.
+\gls{MFCC} features are inspired by human auditory processing inspired and are
+created from a waveform incrementally using several steps:
\begin{enumerate}
\item The first step in the process is converting the time representation
of the signal to a spectral representation using a sliding window with
using triangular overlapping windows to get a more tonotopic
representation trying to match the actual representation in the cochlea
of the human ear.
- \item The \emph{Weber-Fechner} law that describes how humans perceive physical
+ \item The \emph{Weber-Fechner} law describes how humans perceive physical
magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
- Psychophysik} and it was found that energy is perceived in logarithmic
+ Psychophysik}. They found that energy is perceived in logarithmic
increments. This means that twice the amount of decibels does not mean
- twice the amount of perceived loudness. Therefore in this step log is
- taken of energy or amplitude of the \gls{MS} frequency spectrum to
- closer match the human hearing.
+ twice the amount of perceived loudness. Therefore we take the log of
+ the energy or amplitude of the \gls{MS} spectrum to closer match the
+ human hearing.
\item The amplitudes of the spectrum are highly correlated and therefore
the last step is a decorrelation step. \Gls{DCT} is applied on the
amplitudes interpreted as a signal. \Gls{DCT} is a technique of
\subsection{\acrlong{ANN}}
The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
-The classification problems are only binary and four-class so therefore it is
-interesting to see where the bottleneck lies. How abstract the abstraction can
-go. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
+The classification problems are only binary and four-class so it is
+interesting to see where the bottleneck lies; how abstract can the abstraction
+be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
backend that provides a high-level interface to the highly technical networks.
in Equation~\ref{eq:softmax}.
The data is shuffled before fed to the network to mitigate the risk of
-over fitting on one album. Every model was trained using $10$ epochs and a
+overfitting on one album. Every model was trained using $10$ epochs and a
batch size of $32$.
\begin{equation}\label{eq:relu}
\end{subfigure}%
%
\begin{subfigure}{.5\textwidth}
+ \centering
\includegraphics[width=.8\linewidth]{mcann}
\caption{Multiclass classifier network architecture}\label{fig:mcann}
\end{subfigure}
\caption{Plotting the classifier under similar alien data}\label{fig:alien1}
\end{figure}
-To really test the limits a song from the highly atmospheric doom metal band
+To really test the limits, a song from the highly atmospheric doom metal band
called \emph{Catacombs} has been tested on the system. The album \emph{Echoes
Through the Catacombs} is an album that has a lot of synthesizers, heavy
droning guitars and bass lines. The vocals are not mixed in a way that makes
\begin{figure}[H]
\centering
- \includegraphics[width=.6\linewidth]{alien1}.
+ \includegraphics[width=.6\linewidth]{alien2}.
\caption{Plotting the classifier under different alien data}\label{fig:alien2}
\end{figure}