From 32d40f826760917edc89039d53223a045da68caf Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Thu, 18 May 2017 12:18:05 +0200 Subject: [PATCH] processed some of the comments --- asr.tex | 2 ++ intro.tex | 38 ++++++++++++++++++++------------------ methods.tex | 50 ++++++++++++++++++++++++++++---------------------- 3 files changed, 50 insertions(+), 40 deletions(-) diff --git a/asr.tex b/asr.tex index 22b567e..a051094 100644 --- a/asr.tex +++ b/asr.tex @@ -39,6 +39,8 @@ course={(Automatic) Speech Recognition}, institute={Radboud University Nijmegen}, authorstext={Author:}, + righttextheader={Supervisor:}, + righttext={Louis ten Bosch}, pagenr=1] \listoftodos[Todo] diff --git a/intro.tex b/intro.tex index a1133e7..10be98f 100644 --- a/intro.tex +++ b/intro.tex @@ -23,16 +23,18 @@ not directly usable as data. Because of this interest it is very useful to device automatic techniques for segmenting instrumental and vocal parts of a song, apply forced alignment or even lyrics recognition on the audio file. -Such techniques are heavily researched and working systems have been created. -However, these techniques are designed to detect a clean singing voice and have -not been testen on so-called \emph{extended vocal techniques} such as grunting -or growling. Growling is heavily used in extreme metal genres such as \gls{dm} -but it must be noted that grunting is not a technique only used in extreme -metal styles. Similar or equal techniques have been used in \emph{Beijing -opera}, Japanese \emph{Noh} and but also more western styles like jazz singing -by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back -to viking times. For example, an arab merchant visiting a village in Denmark -wrote in the tenth century\cite{friis_vikings_2004}: +These techniques are heavily researched and working systems have been created +for segmenting audio and even forced alignment (e.g.\ LyricSynchronizer% +\cite{fujihara_lyricsynchronizer:_2011}). However, these techniques are designed +to detect a clean singing voice and have not been testen on so-called +\emph{extended vocal techniques} such as grunting or growling. Growling is +heavily used in extreme metal genres such as \gls{dm} but it must be noted that +grunting is not a technique only used in extreme metal styles. Similar or equal +techniques have been used in \emph{Beijing opera}, Japanese \emph{Noh} and but +also more western styles like jazz singing by Louis +Armstrong\cite{sakakibara_growl_2004}. It might even be traced back to viking +times. For example, an arab merchant visiting a village in Denmark wrote in the +tenth century\cite{friis_vikings_2004}: \begin{displayquote} Never before I have heard uglier songs than those of the Vikings in @@ -61,14 +63,14 @@ separating speech from non-speech signals such as music. The data used was already segmented. Later, Berenzweig showed singing voice segments to be more useful for artist -classification and used a \gls{MLP} using \gls{PLP} coefficients to separate -detect singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed that there -is not much difference in accuracy when using different features founded in -speech processing. They tested several features and found accuracies differ -less that a few percent. Moreover, they found that others have tried to tackle -the problem using myriads of different approaches such as using \gls{ZCR}, -\gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} as -classifiers\cite{nwe_singing_2004}. +classification and used a \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients to +separate detect singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed +that there is not much difference in accuracy when using different features +founded in speech processing. They tested several features and found accuracies +differ less that a few percent. Moreover, they found that others have tried to +tackle the problem using myriads of different approaches such as using +\gls{ZCR}, \gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} +as classifiers\cite{nwe_singing_2004}. Fujihara et al.\ took the idea to a next level by attempting to do \gls{FA} on music. Their approach is a three step approach. First step is reducing the diff --git a/methods.tex b/methods.tex index 37ca37d..078ebbf 100644 --- a/methods.tex +++ b/methods.tex @@ -5,14 +5,13 @@ To run the experiments data has been collected from several \gls{dm} albums. The exact data used is available in Appendix~\ref{app:data}. The albums are extracted from the audio CD and converted to a mono channel waveform with the -correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. -Every file is annotated using -Praat\cite{boersma_praat_2002} where the utterances are manually aligned to -the audio. Examples of utterances are shown in -Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the -waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible -that within the genre of death metal there are a different spectral patterns -visible. +correct samplerate utilizing \emph{SoX}% +\footnote{\url{http://sox.sourceforge.net/}}. Every file is annotated using +Praat\cite{boersma_praat_2002} where the utterances are manually aligned to the +audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and +Figure~\ref{fig:abominations} where the waveform, $1-8000$Hz spectrals and +annotations are shown. It is clearly visible that within the genre of death +metal there are different spectral patterns visible over time. \begin{figure}[ht] \centering @@ -28,12 +27,11 @@ visible. \emph{Enthroned Abominations}}\label{fig:abominations} \end{figure} -The data is collected from three studio albums. The -first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for -almost 25 years and have been creating the same type every album. The singer of -\emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite -comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering -regular shouting. +The data is collected from three studio albums. The first band is called +\emph{Cannibal Corpse} and has been producing \gls{dm} for almost 25 years and +have been creating the same type every album. The singer of \emph{Cannibal +Corpse} has a very raspy growls and the lyrics are quite comprehensible. The +vocals produced by \emph{Cannibal Corpse} are bordering regular shouting. The second band is called \emph{Disgorge} and make even more violently sounding music. The growls of the lead singer sound like a coffee grinder and are more @@ -80,8 +78,8 @@ performance\cite{you_comparative_2015}. The actual conversion is done using the \emph{python\_speech\_features}% \footnote{\url{https://github.com/jameslyons/python_speech_features}} package. -\gls{MFCC} features are nature inspired and built incrementally in several -steps. +\gls{MFCC} features are inspired by human auditory processing inspired and +built incrementally in several steps. \begin{enumerate} \item The first step in the process is converting the time representation of the signal to a spectral representation using a sliding window with @@ -92,13 +90,21 @@ steps. impossible so it is arguable that the window size is very small. \item The standard \gls{FT} gives a spectral representation that has linearly scaled frequencies. This scale is converted to the \gls{MS} - using triangular overlapping windows. - \item The log is taken of the Mel frequencies. This step is inspired by the - \emph{Weber-Fechner} law that describes how humans perceive physical + using triangular overlapping windows to get a more tonotopic + representation trying to match the actual representation in the cochlea + of the human ear. + \item The \emph{Weber-Fechner} law that describes how humans perceive physical magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der - Psychophysik} - \item To decorrelate the signal a \gls{DCT} is applied. The \gls{MFCC} - features are then the amplitudes of the spectrum. + Psychophysik} and it was found that energy is perceived in logarithmic + increments. This means that twice the amount of decibels does not mean + twice the amount of perceived loudness. Therefore in this step log is + taken of energy or amplitude of the \gls{MS} frequency spectrum to + closer match the human hearing. + \item The amplitudes of the spectrum are highly correlated and therefore + the last step is a decorrelation step. \Gls{DCT} is applied on the + amplitudes interpreted as a signal. \Gls{DCT} is a technique of + describing a signal as a combination of several primitive cosine + functions. \end{enumerate} \section{\gls{ANN} Classifier} -- 2.20.1