mfcc

[asr1617.git] / asr.tex
diff --git a/asr.tex b/asr.tex

index 2c557de..3f5eaaf 100644 (file)
--- a/asr.tex
+++ b/asr.tex
@@ -1,6 +1,6 @@
  %&asr
  \usepackage[nonumberlist,acronyms]{glossaries}
-\makeglossaries%
+%\makeglossaries%
  \newacronym{ANN}{ANN}{Artificial Neural Network}
  \newacronym{HMM}{HMM}{Hidden Markov Model}
  \newacronym{GMM}{GMM}{Gaussian Mixture Models}
@@ -13,6 +13,14 @@
  \newglossaryentry{dm}{name={Death Metal},
         description={is an extreme heavy metal music style with growling vocals and
         pounding drums}}
+\newglossaryentry{dom}{name={Doom Metal},
+       description={is an extreme heavy metal music style with growling vocals and
+       pounding drums played very slowly}}
+\newglossaryentry{FT}{name={Fourier Transform},
+       description={is a technique of converting a time representation signal to a
+       frequency representation}}
+\newglossaryentry{MS}{name={Mel-Scale},
+       description={is a human ear inspired scale for spectral signals.}}
  
  \begin{document}
  \frontmatter{}
@@ -59,7 +67,7 @@ physical sale and the remaining $16\%$ is made through performance and
  synchronisation revenieus. The overtake of digital formats on physical formats
  took place somewhere in 2015. Moreover, ever since twenty years the music
  industry has seen significant growth 
-again~\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
+again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  
  There has always been an interest in lyrics to music alignment to be used in
  for example karaoke. As early as in the late 1980s karaoke machines were
@@ -68,7 +76,7 @@ available, a alignment is not and it involves manual labour to create such an
  alignment.
  
  A lot of this musical distribution goes via non-official channels such as
-YouTube~\footnote{\url{https://youtube.com}} in which fans of the performers
+YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
  often accompany the music with synchronized lyrics. This means that there is an
  enormous treasure of lyrics-annotated music available but not within our reach
  since the subtitles are almost always hardcoded into the video stream and thus
@@ -83,9 +91,9 @@ or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
  but it must be noted that grunting is not a technique only used in extreme
  metal styles. Similar or equal techniques have been used in \emph{Beijing
  opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
-by Louis Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back
+by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
  to viking times. For example, an arab merchant visiting a village in Denmark
-wrote in the tenth century~\cite{friis_vikings_2004}:
+wrote in the tenth century\cite{friis_vikings_2004}:
  
  \begin{displayquote}
         Never before I have heard uglier songs than those of the Vikings in
@@ -98,7 +106,7 @@ wrote in the tenth century~\cite{friis_vikings_2004}:
  %Literature overview / related work
  \section{Related work}
  The field of applying standard speech processing techniques on music started in
-the late 90s~\cite{saunders_real-time_1996,scheirer_construction_1997} and it
+the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
  was found that music has different discriminating features compared to normal
  speech.
  
@@ -140,16 +148,13 @@ research question:
  To run the experiments data has been collected from several \gls{dm} albums.
  The exact data used is available in Appendix~\ref{app:data}. The albums are
  extracted from the audio CD and converted to a mono channel waveform with the
-correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
-When the waveforms are finished they are converted to \glspl{MFCC} vectors
-using the \emph{python\_speech\_features}%
-~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
-All these steps combined results in thirteen tab separated features per line in
-a file for every source file. Every file is annotated using
-Praat~\cite{boersma_praat_2002} where the utterances are manually aligned to
+correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
+Every file is annotated using
+Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
  the audio. Examples of utterances are shown in
-Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
-within the genre of death metal there are a different spectral patterns
+Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
+waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
+that within the genre of death metal there are a different spectral patterns
  visible.
  
  \begin{figure}[ht]
@@ -166,59 +171,70 @@ visible.
                 \emph{Enthroned Abominations}}\label{fig:abominations}
  \end{figure}
  
-The data is collected from two\todo{more in the future}\ studio albums. The first
-band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
-25 years and have been creating the same type every album. The singer of
+The data is collected from three studio albums. The
+first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
+almost 25 years and have been creating the same type every album. The singer of
  \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
-comprehensible. The second band is called \emph{Disgorge} and make even more
-violent music. The growls of the lead singer sound more like a coffee grinder
-and are more shallow. The lyrics are completely incomprehensible and therefore
-some parts are not annotated with lyrics because it was too difficult to hear
-what was being sung.
-
-\section{Methods}
-\todo{To remove in final thesis}
-The initial planning is still up to date. About one and a half album has been
-annotated and a framework for setting up experiments has been created.
-Moreover, the first exploratory experiments are already been executed and
-promising. In April the experimental dataset will be expanded and I will try to
-mimic some of the experiments done in the literature to see whether it performs
-similar on Death Metal
-\begin{table}[ht]
-       \centering
-       \begin{tabular}{cll}
-               \toprule
-               Month & Description\\
-               \midrule
-               March
-                       & Preparing the data\\
-                       & Preparing an experiment platform\\
-                       & Literature research\\
-               April
-                       & Running the experiments\\
-                       & Fiddle with parameters\\
-                       & Explore the possibilities for forced alignment\\
-               May
-                       & Write up the thesis\\
-                       & Possibly do forced alignment\\
-               June
-                       & Finish up thesis\\
-                       & Wrap up\\
-               \bottomrule
-       \end{tabular}
-       \caption{Outline}
-\end{table}
+comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
+regular shouting. 
+
+The second band is called \emph{Disgorge} and make even more violently sounding
+music. The growls of the lead singer sound like a coffee grinder and are more
+shallow. In the spectrals it is clearly visible that there are overtones
+produced during some parts of the growling. The lyrics are completely
+incomprehensible and therefore some parts were not annotated with the actual
+lyrics because it was not possible what was being sung.
+
+Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
+Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
+bands because they create \gls{dom}. \gls{dom} is characterized by the very
+slow tempo and low tuned guitars. The vocalist has a very characteristic growl
+and performs in several moscovian bands. This band also stands out because it
+uses piano's and synthesizers. The droning synthesizers often operate in the
+same frequency as the vocals.
+
+\section{\gls{MFCC} Features}
+The waveforms in itself are not very suitable to be used as features due to the
+high dimensionality and correlation. Therefore we use the aften used
+\glspl{MFCC} feature vectors.\todo{cite which papers use this} The actual
+conversion is done using the \emph{python\_speech\_features}%
+\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
+
+\gls{MFCC} features are nature inspired and built incrementally in a several of
+steps. 
+\begin{enumerate}
+       \item The first step in the process is converting the time representation
+               of the signal to a spectral representation using a sliding window with
+               overlap. The width of the window and the step size are two important
+               parameters in the system. In classical phonetic analysis window sizes
+               of $25ms$ with a step of $10ms$ are often chosen because they are small
+               enough to only contain subphone entities. Singing for $25ms$ is
+               impossible so it is arguable that the window size is very small.
+       \item The standard \gls{FT} gives a spectral representation that has
+               linearly scaled frequencies. This scale is converted to the \gls{MS}
+               using triangular overlapping windows.
+       \item
+\end{enumerate}
+
  
  \todo{Explain why MFCC and which parameters}
+
+\section{\gls{ANN} Classifier}
  \todo{Spectrals might be enough, no decorrelation}
  
+\section{Model training}
+
  \section{Experiments}
  
  \section{Results}
  
  
  \chapter{Conclusion \& Discussion}
+\section{Conclusion}
  %Discussion section
+
+\section{Discussion}
+
  \todo{Novelty}
  \todo{Weaknesses}
  \todo{Dataset is not very varied but\ldots}
@@ -258,6 +274,15 @@ similar on Death Metal
                 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
                 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
                 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
+               22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\
+               23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\
+               24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\
+               25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\
+               26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\
+               27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\
+               28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\
+               \midrule
+               & & & Total: & 02:13:40\\
                 \bottomrule
         \end{tabular}
         \caption{Songs used in the experiments}