X-Git-Url: https://git.martlubbers.net/?a=blobdiff_plain;f=asr.tex;h=3f5eaaf8f02197197b5e58c1fec243ed80cdfb15;hb=0ada197a78af4323b8cf5efc508a5aab3d80e4b2;hp=e7c62977fdba7bbd93c512890811a93603603488;hpb=f945aee6ab335b268bf25d476942ea8471916382;p=asr1617.git

diff --git a/asr.tex b/asr.tex
index e7c6297..3f5eaaf 100644
--- a/asr.tex
+++ b/asr.tex
@@ -1,6 +1,6 @@
 %&asr
 \usepackage[nonumberlist,acronyms]{glossaries}
-\makeglossaries%
+%\makeglossaries%
 \newacronym{ANN}{ANN}{Artificial Neural Network}
 \newacronym{HMM}{HMM}{Hidden Markov Model}
 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
@@ -9,9 +9,18 @@
 \newacronym{FA}{FA}{Forced alignment}
 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
+\newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
 \newglossaryentry{dm}{name={Death Metal},
 	description={is an extreme heavy metal music style with growling vocals and
 	pounding drums}}
+\newglossaryentry{dom}{name={Doom Metal},
+	description={is an extreme heavy metal music style with growling vocals and
+	pounding drums played very slowly}}
+\newglossaryentry{FT}{name={Fourier Transform},
+	description={is a technique of converting a time representation signal to a
+	frequency representation}}
+\newglossaryentry{MS}{name={Mel-Scale},
+	description={is a human ear inspired scale for spectral signals.}}
 
 \begin{document}
 \frontmatter{}
@@ -26,80 +35,126 @@
 \tableofcontents
 
 %Glossaries
-\glsaddall{}
-\printglossaries%
+%\glsaddall{}
+%\printglossaries
 
 \mainmatter{}
-Berenzweig and Ellis use acoustic classifiers from speech recognition as a
-detector for singing lines.  They achive 80\% accuracy for forty 15 second
-exerpts. They mention people that wrote signal features that discriminate
-between speech and music. Neural net
-\glspl{HMM}~\cite{berenzweig_locating_2001}.
-
-In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
-polyphonic turkish music, this might be interesting to use for heavy metal.
-They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
-phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
-detection, then melody extraction, then alignment. They compare results with
-Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
-specialize in long syllables in a capella. They use \glspl{DHMM} with
-\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
-has long syllables)~\cite{dzhambazov_automatic_2016}.
-
-t\cite{fujihara_automatic_2006}
-t\cite{fujihara_lyricsynchronizer:_2011}
-t\cite{fujihara_three_2008}
-t\cite{mauch_integrating_2012}
-t\cite{mesaros_adaptation_2009}
-t\cite{mesaros_automatic_2008}
-t\cite{mesaros_automatic_2010}
-t\cite{muller_multimodal_2012}
-t\cite{pedone_phoneme-level_2011}
-t\cite{yang_machine_2012}
+%Berenzweig and Ellis use acoustic classifiers from speech recognition as a
+%detector for singing lines.  They achive 80\% accuracy for forty 15 second
+%exerpts. They mention people that wrote signal features that discriminate
+%between speech and music. Neural net
+%\glspl{HMM}~\cite{berenzweig_locating_2001}.
+%
+%In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
+%polyphonic turkish music, this might be interesting to use for heavy metal.
+%They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
+%phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
+%detection, then melody extraction, then alignment. They compare results with
+%Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
+%specialize in long syllables in a capella. They use \glspl{DHMM} with
+%\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
+%has long syllables)~\cite{dzhambazov_automatic_2016}.
+%
 
 
 %Introduction, leading to a clearly defined research question
 \chapter{Introduction}
 \section{Introduction}
-Music is a leading type of data distributed on the internet. Regular music
-distribution is almost entirely digital and services like Spotify and YouTube
-allow one to listen to almost any song within a few clicks. Moreover, there are
-myriads of websites offering lyrics of songs.
+The primary medium for music distribution is rapidly changing from physical
+media to digital media. The \gls{IFPI} stated that about $43\%$ of music
+revenue rises from digital distribution. Another $39\%$ arises from the
+physical sale and the remaining $16\%$ is made through performance and
+synchronisation revenieus. The overtake of digital formats on physical formats
+took place somewhere in 2015. Moreover, ever since twenty years the music
+industry has seen significant growth 
+again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
 
-\todo{explain relevancy, (preprocessing for lyric alignment)}
+There has always been an interest in lyrics to music alignment to be used in
+for example karaoke. As early as in the late 1980s karaoke machines were
+available for consumers. While the lyrics for the track are almost always
+available, a alignment is not and it involves manual labour to create such an
+alignment.
 
-This leads to the following research question:
-\begin{center}\em%
-	Are standard \gls{ANN} based techniques for singing voice detection
-	suitable for non-standard musical genres like Death metal.
-\end{center}
+A lot of this musical distribution goes via non-official channels such as
+YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
+often accompany the music with synchronized lyrics. This means that there is an
+enormous treasure of lyrics-annotated music available but not within our reach
+since the subtitles are almost always hardcoded into the video stream and thus
+not directly usable as data. Because of this interest it is very useful to
+device automatic techniques for segmenting instrumental and vocal parts of a
+song, apply forced alignment or even lyrics recognition on the audio file.
+
+Such techniques are heavily researched and working systems have been created.
+However, these techniques are designed to detect a clean singing voice and have
+not been testen on so-called \emph{extended vocal techniques} such as grunting
+or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
+but it must be noted that grunting is not a technique only used in extreme
+metal styles. Similar or equal techniques have been used in \emph{Beijing
+opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
+by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
+to viking times. For example, an arab merchant visiting a village in Denmark
+wrote in the tenth century\cite{friis_vikings_2004}:
+
+\begin{displayquote}
+	Never before I have heard uglier songs than those of the Vikings in
+	Slesvig. The growling sound coming from their throats reminds me of dogs
+	howling, only more untamed.
+\end{displayquote}
+
+\section{\gls{dm}}
 
 %Literature overview / related work
 \section{Related work}
+The field of applying standard speech processing techniques on music started in
+the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
+was found that music has different discriminating features compared to normal
+speech.
+
+Berenzweig and Ellis expanded on the aforementioned research by trying to
+separate singing from instrumental music\cite{berenzweig_locating_2001}.
 
-Singing/non-singing detection has been fairecent topic of interest in the
-academia. Just in 2001 Berenzweig and Ellis~\cite{berenzweig_locating_2001}
-researched singing voice detection in stead of the more founded topic of
-discerning music from regular speech. In their research 
+\todo{Incorporate this in literary framing}%
+~\cite{fujihara_automatic_2006}%
+~\cite{fujihara_lyricsynchronizer:_2011}%
+~\cite{fujihara_three_2008}%
+~\cite{mauch_integrating_2012}%
+~\cite{mesaros_adaptation_2009}%
+~\cite{mesaros_automatic_2008}%
+~\cite{mesaros_automatic_2010}%
+~%\cite{muller_multimodal_2012}%
+~\cite{pedone_phoneme-level_2011}%
+~\cite{yang_machine_2012}%
+
+
+
+\section{Research question}
+It is discutable whether the aforementioned techniques work because the
+spectral properties of a growling voice is different from the spectral
+properties of a clean singing voice. It has been found that growling voices
+have less prominent peaks in the frequency representation and are closer to
+noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
+research question:
+
+\begin{center}\em%
+	Are standard \gls{ANN} based techniques for singing voice detection
+	suitable for non-standard musical genres like \gls{dm}.
+\end{center}
 
 \chapter{Methods}
 %Methodology
 
 %Experiment(s) (set-up, data, results, discussion)
 \section{Data \& Preprocessing}
-To run the experiments we have collected data from several \gls{dm} albums. The
-exact data used is available in Appendix~\ref{app:data}. The albums are
+To run the experiments data has been collected from several \gls{dm} albums.
+The exact data used is available in Appendix~\ref{app:data}. The albums are
 extracted from the audio CD and converted to a mono channel waveform with the
-correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
-When the waveforms are finished they are converted to \glspl{MFCC} vectors
-using the \emph{python\_speech\_features}%
-~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
-All these steps combined results in thirteen tab separated features per line in
-a file for every source file. Every file is annotated using
-Praat~\cite{boersma_praat_2002} where the utterances are manually
-aligned to the audio. An example of an utterances are shown in
-Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
-within the genre of death metal there are a lot of different spectral patterns
+correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
+Every file is annotated using
+Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
+the audio. Examples of utterances are shown in
+Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
+waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
+that within the genre of death metal there are a different spectral patterns
 visible.
 
 \begin{figure}[ht]
@@ -116,18 +171,75 @@ visible.
 		\emph{Enthroned Abominations}}\label{fig:abominations}
 \end{figure}
 
-The data is collected from two\todo{more in the future}\ studio albums. The first
-band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
-25 years and have been creating the same type every album. The singer of
+The data is collected from three studio albums. The
+first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
+almost 25 years and have been creating the same type every album. The singer of
 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
-comprehensible. The second band is called \emph{Disgorge} and make even more
-violent music. The growls of the lead singer sound more like a coffee grinder
-and are more shallow. The lyrics are completely incomprehensible and therefore
-some parts are not annotated with lyrics because it was too difficult to hear
-what was being sung.
+comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
+regular shouting. 
+
+The second band is called \emph{Disgorge} and make even more violently sounding
+music. The growls of the lead singer sound like a coffee grinder and are more
+shallow. In the spectrals it is clearly visible that there are overtones
+produced during some parts of the growling. The lyrics are completely
+incomprehensible and therefore some parts were not annotated with the actual
+lyrics because it was not possible what was being sung.
+
+Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
+Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
+bands because they create \gls{dom}. \gls{dom} is characterized by the very
+slow tempo and low tuned guitars. The vocalist has a very characteristic growl
+and performs in several moscovian bands. This band also stands out because it
+uses piano's and synthesizers. The droning synthesizers often operate in the
+same frequency as the vocals.
+
+\section{\gls{MFCC} Features}
+The waveforms in itself are not very suitable to be used as features due to the
+high dimensionality and correlation. Therefore we use the aften used
+\glspl{MFCC} feature vectors.\todo{cite which papers use this} The actual
+conversion is done using the \emph{python\_speech\_features}%
+\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
+
+\gls{MFCC} features are nature inspired and built incrementally in a several of
+steps. 
+\begin{enumerate}
+	\item The first step in the process is converting the time representation
+		of the signal to a spectral representation using a sliding window with
+		overlap. The width of the window and the step size are two important
+		parameters in the system. In classical phonetic analysis window sizes
+		of $25ms$ with a step of $10ms$ are often chosen because they are small
+		enough to only contain subphone entities. Singing for $25ms$ is
+		impossible so it is arguable that the window size is very small.
+	\item The standard \gls{FT} gives a spectral representation that has
+		linearly scaled frequencies. This scale is converted to the \gls{MS}
+		using triangular overlapping windows.
+	\item
+\end{enumerate}
+
+
+\todo{Explain why MFCC and which parameters}
+
+\section{\gls{ANN} Classifier}
+\todo{Spectrals might be enough, no decorrelation}
+
+\section{Model training}
+
+\section{Experiments}
+
+\section{Results}
+
 
 \chapter{Conclusion \& Discussion}
+\section{Conclusion}
 %Discussion section
+
+\section{Discussion}
+
+\todo{Novelty}
+\todo{Weaknesses}
+\todo{Dataset is not very varied but\ldots}
+
+\todo{Doom metal}
 %Conclusion section
 %Acknowledgements
 %Statement on authors' contributions
@@ -162,6 +274,15 @@ what was being sung.
 		19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
 		20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
 		21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
+		22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\
+		23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\
+		24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\
+		25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\
+		26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & ÐÐ°Ð²ÐµÑÐ°Ð½Ð¸Ðµ ÐÑÐ¼Ð¸Ð»ÑÐ²Ð° & 08:46.76\\
+		27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\
+		28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\
+		\midrule
+		& & & Total: & 02:13:40\\
 		\bottomrule
 	\end{tabular}
 	\caption{Songs used in the experiments}