mfcc

[asr1617.git] / asr.tex
diff --git a/asr.tex b/asr.tex

index 2b812d1..3f5eaaf 100644 (file)
--- a/asr.tex
+++ b/asr.tex
@@ -1,6 +1,6 @@
  %&asr
  \usepackage[nonumberlist,acronyms]{glossaries}
  %&asr
  \usepackage[nonumberlist,acronyms]{glossaries}
-\makeglossaries%
+%\makeglossaries%
  \newacronym{ANN}{ANN}{Artificial Neural Network}
  \newacronym{HMM}{HMM}{Hidden Markov Model}
  \newacronym{GMM}{GMM}{Gaussian Mixture Models}
  \newacronym{ANN}{ANN}{Artificial Neural Network}
  \newacronym{HMM}{HMM}{Hidden Markov Model}
  \newacronym{GMM}{GMM}{Gaussian Mixture Models}
@@ -13,6 +13,14 @@
  \newglossaryentry{dm}{name={Death Metal},
         description={is an extreme heavy metal music style with growling vocals and
         pounding drums}}
  \newglossaryentry{dm}{name={Death Metal},
         description={is an extreme heavy metal music style with growling vocals and
         pounding drums}}
+\newglossaryentry{dom}{name={Doom Metal},
+       description={is an extreme heavy metal music style with growling vocals and
+       pounding drums played very slowly}}
+\newglossaryentry{FT}{name={Fourier Transform},
+       description={is a technique of converting a time representation signal to a
+       frequency representation}}
+\newglossaryentry{MS}{name={Mel-Scale},
+       description={is a human ear inspired scale for spectral signals.}}
  
  \begin{document}
  \frontmatter{}
  
  \begin{document}
  \frontmatter{}
@@ -52,27 +60,40 @@
  %Introduction, leading to a clearly defined research question
  \chapter{Introduction}
  \section{Introduction}
  %Introduction, leading to a clearly defined research question
  \chapter{Introduction}
  \section{Introduction}
-The \gls{IFPI} stated that about $43\%$ of music revenue rises from digital
-distribution. The overtake on physical formats took place somewhere in 2015 and
-since twenty years the music industry has seen significant
-growth~\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
+The primary medium for music distribution is rapidly changing from physical
+media to digital media. The \gls{IFPI} stated that about $43\%$ of music
+revenue rises from digital distribution. Another $39\%$ arises from the
+physical sale and the remaining $16\%$ is made through performance and
+synchronisation revenieus. The overtake of digital formats on physical formats
+took place somewhere in 2015. Moreover, ever since twenty years the music
+industry has seen significant growth 
+again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
+
+There has always been an interest in lyrics to music alignment to be used in
+for example karaoke. As early as in the late 1980s karaoke machines were
+available for consumers. While the lyrics for the track are almost always
+available, a alignment is not and it involves manual labour to create such an
+alignment.
  
  A lot of this musical distribution goes via non-official channels such as
  
  A lot of this musical distribution goes via non-official channels such as
-YouTube~\footnote{\url{https://youtube.com}} in which fans of the musical group
-accompany the music with synchronized lyrics so that users can sing or read
-along. Because of this interest it is very useful to device automatic
-techniques for segmenting instrumental and vocal parts of a song and
-apply forced alignment or even lyrics recognition on the audio file.
+YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
+often accompany the music with synchronized lyrics. This means that there is an
+enormous treasure of lyrics-annotated music available but not within our reach
+since the subtitles are almost always hardcoded into the video stream and thus
+not directly usable as data. Because of this interest it is very useful to
+device automatic techniques for segmenting instrumental and vocal parts of a
+song, apply forced alignment or even lyrics recognition on the audio file.
  
  Such techniques are heavily researched and working systems have been created.
  
  Such techniques are heavily researched and working systems have been created.
-However, these techniques are designed to detect a clean singing voice. Extreme
-genres such as \gls{dm} are using more extreme vocal techniques such as
-grunting or growling. It must be noted that grunting is not a technique only
-used in extreme metal styles. Similar or equal techniques have been used in
-\emph{Beijing opera}, Japanese \emph{Noh} and but also more western styles like
-jazz singing by Louis Armstrong~\cite{sakakibara_growl_2004}. It might even be
-traced back to viking times. An arab merchant wrote in the tenth
-century~\cite{friis_vikings_2004}:
+However, these techniques are designed to detect a clean singing voice and have
+not been testen on so-called \emph{extended vocal techniques} such as grunting
+or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
+but it must be noted that grunting is not a technique only used in extreme
+metal styles. Similar or equal techniques have been used in \emph{Beijing
+opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
+by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
+to viking times. For example, an arab merchant visiting a village in Denmark
+wrote in the tenth century\cite{friis_vikings_2004}:
  
  \begin{displayquote}
         Never before I have heard uglier songs than those of the Vikings in
  
  \begin{displayquote}
         Never before I have heard uglier songs than those of the Vikings in
@@ -80,26 +101,12 @@ century~\cite{friis_vikings_2004}:
         howling, only more untamed.
  \end{displayquote}
  
         howling, only more untamed.
  \end{displayquote}
  
-%A majority of the music is not only instrumental but also contains vocal
-%segments.
-%
-%Music is a leading type of data distributed on the internet. Regular music
-%distribution is almost entirely digital and services like Spotify and YouTube
-%allow one to listen to almost any song within a few clicks. Moreover, there are
-%myriads of websites offering lyrics of songs.
-%
-%\todo{explain relevancy, (preprocessing for lyric alignment)}
-%
-%This leads to the following research question:
-%\begin{center}\em%
-%      Are standard \gls{ANN} based techniques for singing voice detection
-%      suitable for non-standard musical genres like Death metal.
-%\end{center}
+\section{\gls{dm}}
  
  %Literature overview / related work
  \section{Related work}
  The field of applying standard speech processing techniques on music started in
  
  %Literature overview / related work
  \section{Related work}
  The field of applying standard speech processing techniques on music started in
-the late 90s~\cite{saunders_real-time_1996,scheirer_construction_1997} and it
+the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
  was found that music has different discriminating features compared to normal
  speech.
  
  was found that music has different discriminating features compared to normal
  speech.
  
@@ -141,16 +148,13 @@ research question:
  To run the experiments data has been collected from several \gls{dm} albums.
  The exact data used is available in Appendix~\ref{app:data}. The albums are
  extracted from the audio CD and converted to a mono channel waveform with the
  To run the experiments data has been collected from several \gls{dm} albums.
  The exact data used is available in Appendix~\ref{app:data}. The albums are
  extracted from the audio CD and converted to a mono channel waveform with the
-correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
-When the waveforms are finished they are converted to \glspl{MFCC} vectors
-using the \emph{python\_speech\_features}%
-~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
-All these steps combined results in thirteen tab separated features per line in
-a file for every source file. Every file is annotated using
-Praat~\cite{boersma_praat_2002} where the utterances are manually aligned to
-the audio. An example of an utterances are shown in
-Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
-within the genre of death metal there are a lot of different spectral patterns
+correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
+Every file is annotated using
+Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
+the audio. Examples of utterances are shown in
+Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
+waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
+that within the genre of death metal there are a different spectral patterns
  visible.
  
  \begin{figure}[ht]
  visible.
  
  \begin{figure}[ht]
@@ -167,59 +171,70 @@ visible.
                 \emph{Enthroned Abominations}}\label{fig:abominations}
  \end{figure}
  
                 \emph{Enthroned Abominations}}\label{fig:abominations}
  \end{figure}
  
-The data is collected from two\todo{more in the future}\ studio albums. The first
-band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
-25 years and have been creating the same type every album. The singer of
+The data is collected from three studio albums. The
+first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
+almost 25 years and have been creating the same type every album. The singer of
  \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
  \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
-comprehensible. The second band is called \emph{Disgorge} and make even more
-violent music. The growls of the lead singer sound more like a coffee grinder
-and are more shallow. The lyrics are completely incomprehensible and therefore
-some parts are not annotated with lyrics because it was too difficult to hear
-what was being sung.
-
-\section{Methods}
-\todo{To remove in final thesis}
-The initial planning is still up to date. About one and a half album has been
-annotated and a framework for setting up experiments has been created.
-Moreover, the first exploratory experiments are already been executed and
-promising. In April the experimental dataset will be expanded and I will try to
-mimic some of the experiments done in the literature to see whether it performs
-similar on Death Metal
-\begin{table}[ht]
-       \centering
-       \begin{tabular}{cll}
-               \toprule
-               Month & Description\\
-               \midrule
-               March
-                       & Preparing the data\\
-                       & Preparing an experiment platform\\
-                       & Literature research\\
-               April
-                       & Running the experiments\\
-                       & Fiddle with parameters\\
-                       & Explore the possibilities for forced alignment\\
-               May
-                       & Write up the thesis\\
-                       & Possibly do forced alignment\\
-               June
-                       & Finish up thesis\\
-                       & Wrap up\\
-               \bottomrule
-       \end{tabular}
-       \caption{Outline}
-\end{table}
+comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
+regular shouting. 
+
+The second band is called \emph{Disgorge} and make even more violently sounding
+music. The growls of the lead singer sound like a coffee grinder and are more
+shallow. In the spectrals it is clearly visible that there are overtones
+produced during some parts of the growling. The lyrics are completely
+incomprehensible and therefore some parts were not annotated with the actual
+lyrics because it was not possible what was being sung.
+
+Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
+Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
+bands because they create \gls{dom}. \gls{dom} is characterized by the very
+slow tempo and low tuned guitars. The vocalist has a very characteristic growl
+and performs in several moscovian bands. This band also stands out because it
+uses piano's and synthesizers. The droning synthesizers often operate in the
+same frequency as the vocals.
+
+\section{\gls{MFCC} Features}
+The waveforms in itself are not very suitable to be used as features due to the
+high dimensionality and correlation. Therefore we use the aften used
+\glspl{MFCC} feature vectors.\todo{cite which papers use this} The actual
+conversion is done using the \emph{python\_speech\_features}%
+\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
+
+\gls{MFCC} features are nature inspired and built incrementally in a several of
+steps. 
+\begin{enumerate}
+       \item The first step in the process is converting the time representation
+               of the signal to a spectral representation using a sliding window with
+               overlap. The width of the window and the step size are two important
+               parameters in the system. In classical phonetic analysis window sizes
+               of $25ms$ with a step of $10ms$ are often chosen because they are small
+               enough to only contain subphone entities. Singing for $25ms$ is
+               impossible so it is arguable that the window size is very small.
+       \item The standard \gls{FT} gives a spectral representation that has
+               linearly scaled frequencies. This scale is converted to the \gls{MS}
+               using triangular overlapping windows.
+       \item
+\end{enumerate}
+
  
  \todo{Explain why MFCC and which parameters}
  
  \todo{Explain why MFCC and which parameters}
+
+\section{\gls{ANN} Classifier}
  \todo{Spectrals might be enough, no decorrelation}
  
  \todo{Spectrals might be enough, no decorrelation}
  
+\section{Model training}
+
  \section{Experiments}
  
  \section{Results}
  
  
  \chapter{Conclusion \& Discussion}
  \section{Experiments}
  
  \section{Results}
  
  
  \chapter{Conclusion \& Discussion}
+\section{Conclusion}
  %Discussion section
  %Discussion section
+
+\section{Discussion}
+
  \todo{Novelty}
  \todo{Weaknesses}
  \todo{Dataset is not very varied but\ldots}
  \todo{Novelty}
  \todo{Weaknesses}
  \todo{Dataset is not very varied but\ldots}
@@ -259,6 +274,15 @@ similar on Death Metal
                 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
                 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
                 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
                 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
                 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
                 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
+               22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\
+               23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\
+               24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\
+               25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\
+               26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\
+               27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\
+               28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\
+               \midrule
+               & & & Total: & 02:13:40\\
                 \bottomrule
         \end{tabular}
         \caption{Songs used in the experiments}
                 \bottomrule
         \end{tabular}
         \caption{Songs used in the experiments}