processed some of the comments

author Mart Lubbers <mart@martlubbers.net>

Thu, 18 May 2017 10:18:05 +0000 (12:18 +0200)

committer Mart Lubbers <mart@martlubbers.net>

Thu, 18 May 2017 10:18:05 +0000 (12:18 +0200)
author Mart Lubbers <mart@martlubbers.net>
Thu, 18 May 2017 10:18:05 +0000 (12:18 +0200)
committer Mart Lubbers <mart@martlubbers.net>
Thu, 18 May 2017 10:18:05 +0000 (12:18 +0200)
diff --git a/asr.tex b/asr.tex

index 22b567e..a051094 100644 (file)
--- a/asr.tex
+++ b/asr.tex
@@ -39,6 +39,8 @@
         course={(Automatic) Speech Recognition},
         institute={Radboud University Nijmegen},
         authorstext={Author:},
+       righttextheader={Supervisor:},
+       righttext={Louis ten Bosch},
         pagenr=1]
  \listoftodos[Todo]
  
diff --git a/intro.tex b/intro.tex

index a1133e7..10be98f 100644 (file)
--- a/intro.tex
+++ b/intro.tex
@@ -23,16 +23,18 @@ not directly usable as data. Because of this interest it is very useful to
  device automatic techniques for segmenting instrumental and vocal parts of a
  song, apply forced alignment or even lyrics recognition on the audio file.
  
-Such techniques are heavily researched and working systems have been created.
-However, these techniques are designed to detect a clean singing voice and have
-not been testen on so-called \emph{extended vocal techniques} such as grunting
-or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
-but it must be noted that grunting is not a technique only used in extreme
-metal styles. Similar or equal techniques have been used in \emph{Beijing
-opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
-by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
-to viking times. For example, an arab merchant visiting a village in Denmark
-wrote in the tenth century\cite{friis_vikings_2004}:
+These techniques are heavily researched and working systems have been created
+for segmenting audio and even forced alignment (e.g.\ LyricSynchronizer%
+\cite{fujihara_lyricsynchronizer:_2011}). However, these techniques are designed
+to detect a clean singing voice and have not been testen on so-called
+\emph{extended vocal techniques} such as grunting or growling. Growling is
+heavily used in extreme metal genres such as \gls{dm} but it must be noted that
+grunting is not a technique only used in extreme metal styles. Similar or equal
+techniques have been used in \emph{Beijing opera}, Japanese \emph{Noh} and but
+also more western styles like jazz singing by Louis
+Armstrong\cite{sakakibara_growl_2004}. It might even be traced back to viking
+times. For example, an arab merchant visiting a village in Denmark wrote in the
+tenth century\cite{friis_vikings_2004}:
  
  \begin{displayquote}
         Never before I have heard uglier songs than those of the Vikings in
@@ -61,14 +63,14 @@ separating speech from non-speech signals such as music. The data used was
  already segmented.
  
  Later, Berenzweig showed singing voice segments to be more useful for artist
-classification and used a \gls{MLP} using \gls{PLP} coefficients to separate
-detect singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed that there
-is not much difference in accuracy when using different features founded in
-speech processing. They tested several features and found accuracies differ
-less that a few percent. Moreover, they found that others have tried to tackle
-the problem using myriads of different approaches such as using \gls{ZCR},
-\gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} as
-classifiers\cite{nwe_singing_2004}.
+classification and used a \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients to
+separate detect singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed
+that there is not much difference in accuracy when using different features
+founded in speech processing. They tested several features and found accuracies
+differ less that a few percent. Moreover, they found that others have tried to
+tackle the problem using myriads of different approaches such as using
+\gls{ZCR}, \gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM}
+as classifiers\cite{nwe_singing_2004}.
  
  Fujihara et al.\ took the idea to a next level by attempting to do \gls{FA} on
  music. Their approach is a three step approach. First step is reducing the
diff --git a/methods.tex b/methods.tex

index 37ca37d..078ebbf 100644 (file)
--- a/methods.tex
+++ b/methods.tex
@@ -5,14 +5,13 @@
  To run the experiments data has been collected from several \gls{dm} albums.
  The exact data used is available in Appendix~\ref{app:data}. The albums are
  extracted from the audio CD and converted to a mono channel waveform with the
-correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
-Every file is annotated using
-Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
-the audio. Examples of utterances are shown in
-Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
-waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
-that within the genre of death metal there are a different spectral patterns
-visible.
+correct samplerate utilizing \emph{SoX}%
+\footnote{\url{http://sox.sourceforge.net/}}.  Every file is annotated using
+Praat\cite{boersma_praat_2002} where the utterances are manually aligned to the
+audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and
+Figure~\ref{fig:abominations} where the waveform, $1-8000$Hz spectrals and
+annotations are shown. It is clearly visible that within the genre of death
+metal there are different spectral patterns visible over time.
  
  \begin{figure}[ht]
         \centering
@@ -28,12 +27,11 @@ visible.
                 \emph{Enthroned Abominations}}\label{fig:abominations}
  \end{figure}
  
-The data is collected from three studio albums. The
-first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
-almost 25 years and have been creating the same type every album. The singer of
-\emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
-comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
-regular shouting. 
+The data is collected from three studio albums. The first band is called
+\emph{Cannibal Corpse} and has been producing \gls{dm} for almost 25 years and
+have been creating the same type every album. The singer of \emph{Cannibal
+Corpse} has a very raspy growls and the lyrics are quite comprehensible. The
+vocals produced by \emph{Cannibal Corpse} are bordering regular shouting. 
  
  The second band is called \emph{Disgorge} and make even more violently sounding
  music. The growls of the lead singer sound like a coffee grinder and are more
@@ -80,8 +78,8 @@ performance\cite{you_comparative_2015}. The actual conversion is done using the
  \emph{python\_speech\_features}%
  \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
  
-\gls{MFCC} features are nature inspired and built incrementally in several
-steps.
+\gls{MFCC} features are inspired by human auditory processing inspired and
+built incrementally in several steps.
  \begin{enumerate}
         \item The first step in the process is converting the time representation
                 of the signal to a spectral representation using a sliding window with
@@ -92,13 +90,21 @@ steps.
                 impossible so it is arguable that the window size is very small.
         \item The standard \gls{FT} gives a spectral representation that has
                 linearly scaled frequencies. This scale is converted to the \gls{MS}
-               using triangular overlapping windows.
-       \item The log is taken of the Mel frequencies. This step is inspired by the
-               \emph{Weber-Fechner} law that describes how humans perceive physical
+               using triangular overlapping windows to get a more tonotopic
+               representation trying to match the actual representation in the cochlea
+               of the human ear.
+       \item The \emph{Weber-Fechner} law that describes how humans perceive physical
                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
-               Psychophysik}
-       \item To decorrelate the signal a \gls{DCT} is applied. The \gls{MFCC}
-               features are then the amplitudes of the spectrum.
+               Psychophysik} and it was found that energy is perceived in logarithmic
+               increments. This means that twice the amount of decibels does not mean
+               twice the amount of perceived loudness. Therefore in this step log is
+               taken of energy or amplitude of the \gls{MS} frequency spectrum to
+               closer match the human hearing.
+       \item The amplitudes of the spectrum are highly correlated and therefore
+               the last step is a decorrelation step. \Gls{DCT} is applied on the
+               amplitudes interpreted as a signal. \Gls{DCT} is a technique of
+               describing a signal as a combination of several primitive cosine
+               functions.
  \end{enumerate}
  
  \section{\gls{ANN} Classifier}
author	Mart Lubbers <mart@martlubbers.net>
	Thu, 18 May 2017 10:18:05 +0000 (12:18 +0200)
committer	Mart Lubbers <mart@martlubbers.net>
	Thu, 18 May 2017 10:18:05 +0000 (12:18 +0200)
asr.tex		patch \| blob \| history
intro.tex		patch \| blob \| history
methods.tex		patch \| blob \| history