%&asr
\usepackage[nonumberlist,acronyms]{glossaries}
-\makeglossaries%
+%\makeglossaries%
\newacronym{ANN}{ANN}{Artificial Neural Network}
\newacronym{HMM}{HMM}{Hidden Markov Model}
\newacronym{GMM}{GMM}{Gaussian Mixture Models}
\newglossaryentry{dm}{name={Death Metal},
description={is an extreme heavy metal music style with growling vocals and
pounding drums}}
+\newglossaryentry{dom}{name={Doom Metal},
+ description={is an extreme heavy metal music style with growling vocals and
+ pounding drums played very slowly}}
\begin{document}
\frontmatter{}
%Introduction, leading to a clearly defined research question
\chapter{Introduction}
\section{Introduction}
-The \gls{IFPI} stated that about $43\%$ of music revenue rises from digital
-distribution. The overtake on physical formats took place somewhere in 2015 and
-since twenty years the music industry has seen significant
-growth~\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
+The primary medium for music distribution is rapidly changing from physical
+media to digital media. The \gls{IFPI} stated that about $43\%$ of music
+revenue rises from digital distribution. Another $39\%$ arises from the
+physical sale and the remaining $16\%$ is made through performance and
+synchronisation revenieus. The overtake of digital formats on physical formats
+took place somewhere in 2015. Moreover, ever since twenty years the music
+industry has seen significant growth
+again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
+
+There has always been an interest in lyrics to music alignment to be used in
+for example karaoke. As early as in the late 1980s karaoke machines were
+available for consumers. While the lyrics for the track are almost always
+available, a alignment is not and it involves manual labour to create such an
+alignment.
A lot of this musical distribution goes via non-official channels such as
-YouTube~\footnote{\url{https://youtube.com}} in which fans of the musical group
-accompany the music with synchronized lyrics so that users can sing or read
-along. Because of this interest it is very useful to device automatic
-techniques for segmenting instrumental and vocal parts of a song and
-apply forced alignment or even lyrics recognition on the audio file.
+YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
+often accompany the music with synchronized lyrics. This means that there is an
+enormous treasure of lyrics-annotated music available but not within our reach
+since the subtitles are almost always hardcoded into the video stream and thus
+not directly usable as data. Because of this interest it is very useful to
+device automatic techniques for segmenting instrumental and vocal parts of a
+song, apply forced alignment or even lyrics recognition on the audio file.
Such techniques are heavily researched and working systems have been created.
-However, these techniques are designed to detect a clean singing voice. Extreme
-genres such as \gls{dm} are using more extreme vocal techniques such as
-grunting or growling. It must be noted that grunting is not a technique only
-used in extreme metal styles. Similar or equal techniques have been used in
-\emph{Beijing opera}, Japanese \emph{Noh} and but also more western styles like
-jazz singing by Louis Armstrong~\cite{sakakibara_growl_2004}. It might even be
-traced back to viking times. An arab merchant wrote in the tenth
-century~\cite{friis_vikings_2004}:
+However, these techniques are designed to detect a clean singing voice and have
+not been testen on so-called \emph{extended vocal techniques} such as grunting
+or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
+but it must be noted that grunting is not a technique only used in extreme
+metal styles. Similar or equal techniques have been used in \emph{Beijing
+opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
+by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
+to viking times. For example, an arab merchant visiting a village in Denmark
+wrote in the tenth century\cite{friis_vikings_2004}:
\begin{displayquote}
Never before I have heard uglier songs than those of the Vikings in
howling, only more untamed.
\end{displayquote}
-%A majority of the music is not only instrumental but also contains vocal
-%segments.
-%
-%Music is a leading type of data distributed on the internet. Regular music
-%distribution is almost entirely digital and services like Spotify and YouTube
-%allow one to listen to almost any song within a few clicks. Moreover, there are
-%myriads of websites offering lyrics of songs.
-%
-%\todo{explain relevancy, (preprocessing for lyric alignment)}
-%
-%This leads to the following research question:
-%\begin{center}\em%
-% Are standard \gls{ANN} based techniques for singing voice detection
-% suitable for non-standard musical genres like Death metal.
-%\end{center}
+\section{\gls{dm}}
%Literature overview / related work
\section{Related work}
The field of applying standard speech processing techniques on music started in
-the late 90s~\cite{saunders_real-time_1996,scheirer_construction_1997} and it
+the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
was found that music has different discriminating features compared to normal
speech.
To run the experiments data has been collected from several \gls{dm} albums.
The exact data used is available in Appendix~\ref{app:data}. The albums are
extracted from the audio CD and converted to a mono channel waveform with the
-correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
-When the waveforms are finished they are converted to \glspl{MFCC} vectors
-using the \emph{python\_speech\_features}%
-~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
-All these steps combined results in thirteen tab separated features per line in
-a file for every source file. Every file is annotated using
-Praat~\cite{boersma_praat_2002} where the utterances are manually aligned to
-the audio. An example of an utterances are shown in
-Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
-within the genre of death metal there are a lot of different spectral patterns
+correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
+Every file is annotated using
+Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
+the audio. Examples of utterances are shown in
+Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
+waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
+that within the genre of death metal there are a different spectral patterns
visible.
\begin{figure}[ht]
\emph{Enthroned Abominations}}\label{fig:abominations}
\end{figure}
-The data is collected from two\todo{more in the future}\ studio albums. The first
-band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
-25 years and have been creating the same type every album. The singer of
+The data is collected from three studio albums. The
+first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
+almost 25 years and have been creating the same type every album. The singer of
\emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
-comprehensible. The second band is called \emph{Disgorge} and make even more
-violent music. The growls of the lead singer sound more like a coffee grinder
-and are more shallow. The lyrics are completely incomprehensible and therefore
-some parts are not annotated with lyrics because it was too difficult to hear
-what was being sung.
-
-\section{Methods}
-\todo{To remove in final thesis}
-The initial planning is still up to date. About one and a half album has been
-annotated and a framework for setting up experiments has been created.
-Moreover, the first exploratory experiments are already been executed and
-promising. In April the experimental dataset will be expanded and I will try to
-mimic some of the experiments done in the literature to see whether it performs
-similar on Death Metal
-\begin{table}[ht]
- \centering
- \begin{tabular}{cll}
- \toprule
- Month & Description\\
- \midrule
- March
- & Preparing the data\\
- & Preparing an experiment platform\\
- & Literature research\\
- April
- & Running the experiments\\
- & Fiddle with parameters\\
- & Explore the possibilities for forced alignment\\
- May
- & Write up the thesis\\
- & Possibly do forced alignment\\
- June
- & Finish up thesis\\
- & Wrap up\\
- \bottomrule
- \end{tabular}
- \caption{Outline}
-\end{table}
+comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
+regular shouting.
+
+The second band is called \emph{Disgorge} and make even more violently sounding
+music. The growls of the lead singer sound like a coffee grinder and are more
+shallow. In the spectrals it is clearly visible that there are overtones
+produced during some parts of the growling. The lyrics are completely
+incomprehensible and therefore some parts were not annotated with the actual
+lyrics because it was not possible what was being sung.
+
+Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
+Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
+bands because they create \gls{dom}. \gls{dom} is characterized by the very
+slow tempo and low tuned guitars. The vocalist has a very characteristic growl
+and performs in several moscovian bands. This band also stands out because it
+uses piano's and synthesizers. The droning synthesizers often operate in the
+same frequency as the vocals.
+
+\section{\gls{MFCC} Features}
+The waveforms are converted to \glspl{MFCC} feature vectors using the
+\emph{python\_speech\_features}%
+\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
+All these steps combined results in thirteen tab separated features per line in
+a file for every source file. Technical info about the processing steps is
+given in the following sections.
\todo{Explain why MFCC and which parameters}
+
+\section{\gls{ANN} Classifier}
\todo{Spectrals might be enough, no decorrelation}
+\section{Model training}
+
\section{Experiments}
\section{Results}
19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
+ 22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\
+ 23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\
+ 24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\
+ 25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\
+ 26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\
+ 27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\
+ 28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\
+ \midrule
+ & & & Total: & 02:13:40\\
\bottomrule
\end{tabular}
\caption{Songs used in the experiments}