up

[asr1617.git] / asr.tex
diff --git a/asr.tex b/asr.tex

index 4899bd1..f149b1d 100644 (file)
--- a/asr.tex
+++ b/asr.tex
@@ -1,6 +1,6 @@
  %&asr
  \usepackage[nonumberlist,acronyms]{glossaries}
-\makeglossaries%
+%\makeglossaries%
  \newacronym{ANN}{ANN}{Artificial Neural Network}
  \newacronym{HMM}{HMM}{Hidden Markov Model}
  \newacronym{GMM}{GMM}{Gaussian Mixture Models}
@@ -52,62 +52,84 @@
  %Introduction, leading to a clearly defined research question
  \chapter{Introduction}
  \section{Introduction}
-The \gls{IFPI} stated that about $43\%$ of music revenue rises from digital
-distribution. The overtake on physical formats took place somewhere in 2015 and
-since twenty years the music industry has seen significant
-growth~\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
+The primary medium for music distribution is rapidly changing from physical
+media to digital media. The \gls{IFPI} stated that about $43\%$ of music
+revenue rises from digital distribution. Another $39\%$ arises from the
+physical sale and the remaining $16\%$ is made through performance and
+synchronisation revenieus. The overtake of digital formats on physical formats
+took place somewhere in 2015. Moreover, ever since twenty years the music
+industry has seen significant growth 
+again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
+
+There has always been an interest in lyrics to music alignment to be used in
+for example karaoke. As early as in the late 1980s karaoke machines were
+available for consumers. While the lyrics for the track are almost always
+available, a alignment is not and it involves manual labour to create such an
+alignment.
  
  A lot of this musical distribution goes via non-official channels such as
-YouTube~\footnote{\url{https://youtube.com}} in which fans of the musical group
-accompany the music with synchronized lyrics so that users can sing or read
-along. Because of this interest it is very useful to device automatic
-techniques for segmenting instrumental and vocal parts of a song and
-apply forced alignment or even lyrics recognition on the audio file.
+YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
+often accompany the music with synchronized lyrics. This means that there is an
+enormous treasure of lyrics-annotated music available but not within our reach
+since the subtitles are almost always hardcoded into the video stream and thus
+not directly usable as data. Because of this interest it is very useful to
+device automatic techniques for segmenting instrumental and vocal parts of a
+song, apply forced alignment or even lyrics recognition on the audio file.
  
+Such techniques are heavily researched and working systems have been created.
+However, these techniques are designed to detect a clean singing voice and have
+not been testen on so-called \emph{extended vocal techniques} such as grunting
+or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
+but it must be noted that grunting is not a technique only used in extreme
+metal styles. Similar or equal techniques have been used in \emph{Beijing
+opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
+by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
+to viking times. For example, an arab merchant visiting a village in Denmark
+wrote in the tenth century\cite{friis_vikings_2004}:
  
-%A majority of the music is not only instrumental but also contains vocal
-%segments.
-%
-%Music is a leading type of data distributed on the internet. Regular music
-%distribution is almost entirely digital and services like Spotify and YouTube
-%allow one to listen to almost any song within a few clicks. Moreover, there are
-%myriads of websites offering lyrics of songs.
-%
-%\todo{explain relevancy, (preprocessing for lyric alignment)}
-%
-%This leads to the following research question:
-%\begin{center}\em%
-%      Are standard \gls{ANN} based techniques for singing voice detection
-%      suitable for non-standard musical genres like Death metal.
-%\end{center}
+\begin{displayquote}
+       Never before I have heard uglier songs than those of the Vikings in
+       Slesvig. The growling sound coming from their throats reminds me of dogs
+       howling, only more untamed.
+\end{displayquote}
+
+\section{\gls{dm}}
  
  %Literature overview / related work
  \section{Related work}
  The field of applying standard speech processing techniques on music started in
-the late 90s~\cite{saunders_real-time_1996,scheirer_construction_1997} and it
+the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
  was found that music has different discriminating features compared to normal
  speech.
  
  Berenzweig and Ellis expanded on the aforementioned research by trying to
  separate singing from instrumental music\cite{berenzweig_locating_2001}.
  
-\todo{Incorporate this in literary framing}
-~\cite{fujihara_automatic_2006}
-~\cite{fujihara_lyricsynchronizer:_2011}
-~\cite{fujihara_three_2008}
-~\cite{mauch_integrating_2012}
-~\cite{mesaros_adaptation_2009}
-~\cite{mesaros_automatic_2008}
-~\cite{mesaros_automatic_2010}
-~%\cite{muller_multimodal_2012}
-~\cite{pedone_phoneme-level_2011}
-~\cite{yang_machine_2012}
+\todo{Incorporate this in literary framing}%
+~\cite{fujihara_automatic_2006}%
+~\cite{fujihara_lyricsynchronizer:_2011}%
+~\cite{fujihara_three_2008}%
+~\cite{mauch_integrating_2012}%
+~\cite{mesaros_adaptation_2009}%
+~\cite{mesaros_automatic_2008}%
+~\cite{mesaros_automatic_2010}%
+~%\cite{muller_multimodal_2012}%
+~\cite{pedone_phoneme-level_2011}%
+~\cite{yang_machine_2012}%
+
+
  
  \section{Research question}
-This leads to the following research question:
+It is discutable whether the aforementioned techniques work because the
+spectral properties of a growling voice is different from the spectral
+properties of a clean singing voice. It has been found that growling voices
+have less prominent peaks in the frequency representation and are closer to
+noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
+research question:
+
  \begin{center}\em%
         Are standard \gls{ANN} based techniques for singing voice detection
-       suitable for non-standard musical genres like Death metal.
+       suitable for non-standard musical genres like \gls{dm}.
  \end{center}
  
  \chapter{Methods}
@@ -115,19 +137,21 @@ This leads to the following research question:
  
  %Experiment(s) (set-up, data, results, discussion)
  \section{Data \& Preprocessing}
-To run the experiments we have collected data from several \gls{dm} albums. The
-exact data used is available in Appendix~\ref{app:data}. The albums are
+To run the experiments data has been collected from several \gls{dm} albums.
+The exact data used is available in Appendix~\ref{app:data}. The albums are
  extracted from the audio CD and converted to a mono channel waveform with the
-correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
+correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
  When the waveforms are finished they are converted to \glspl{MFCC} vectors
  using the \emph{python\_speech\_features}%
-~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
+\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
  All these steps combined results in thirteen tab separated features per line in
-a file for every source file. Every file is annotated using
-Praat~\cite{boersma_praat_2002} where the utterances are manually
-aligned to the audio. An example of an utterances are shown in
-Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
-within the genre of death metal there are a lot of different spectral patterns
+a file for every source file. Technical info about the processing steps is
+given in the following sections. Every file is annotated using
+Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
+the audio. Examples of utterances are shown in
+Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
+waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
+that within the genre of death metal there are a different spectral patterns
  visible.
  
  \begin{figure}[ht]
@@ -187,11 +211,24 @@ similar on Death Metal
         \caption{Outline}
  \end{table}
  
+\section{Features}
+
+
+\todo{Explain why MFCC and which parameters}
+\todo{Spectrals might be enough, no decorrelation}
+
+\section{Experiments}
+
  \section{Results}
  
  
  \chapter{Conclusion \& Discussion}
  %Discussion section
+\todo{Novelty}
+\todo{Weaknesses}
+\todo{Dataset is not very varied but\ldots}
+
+\todo{Doom metal}
  %Conclusion section
  %Acknowledgements
  %Statement on authors' contributions