From: Mart Lubbers <mart@martlubbers.net>
Date: Wed, 7 Jun 2017 11:18:19 +0000 (+0200)
Subject: process comments for chapter 1 completely
X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=df27f7c8cc1ea29b04747ee7bc2a87c367cd3940;p=asr1617.git

process comments for chapter 1 completely
---

diff --git a/acronyms.tex b/acronyms.tex
index b65e5c8..03f74aa 100644
--- a/acronyms.tex
+++ b/acronyms.tex
@@ -1,7 +1,7 @@
 \newacronym{ANN}{ANN}{Artificial Neural Network}
 \newacronym{DCT}{DCT}{Discrete Cosine Transform}
 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
-\newacronym{FA}{FA}{Forced alignment}
+\newacronym{FA}{FA}{Forced Alignment}
 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
 \newacronym{HMM}{HMM}{Hidden Markov Model}
 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
diff --git a/intro.tex b/intro.tex
index b409685..eb482f1 100644
--- a/intro.tex
+++ b/intro.tex
@@ -14,28 +14,28 @@ available for consumers. Lyrics for tracks are in almost all cases amply
 available. However, a temporal alignment of the lyrics is not and creating it
 involves manual labour.
 
-A lot of the current day musical distribution goes via non-official channels such as
-YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
-often accompany the music with synchronized lyrics. This means that there is an
-enormous treasure of lyrics-annotated music available. However, the data is not
-within our reach since the subtitles are almost always hardcoded into the video
-stream and thus not directly accessible as data. It sparks the ideas for
-creating automatic techniques for segmenting instrumental and vocal parts of a
-song, apply forced temporal alignment or possible even apply lyrics recognition
-audio data.
+A lot of the current day musical distribution goes via non-official channels
+such as YouTube\footnote{\url{https://youtube.com}} in which fans of the
+performers often accompany the music with synchronized lyrics. This means that
+there is an enormous treasure of lyrics-annotated music available. However, the
+data is not within our reach since the subtitles are almost always hardcoded
+into the video stream and thus not directly accessible as data. It sparks the
+ideas for creating automatic techniques for segmenting instrumental and vocal
+parts of a song, apply forced temporal alignment or possible even apply lyrics
+recognition audio data.
 
 These techniques are heavily researched and working systems have been created
-for segmenting audio and even forced alignment (e.g.\ LyricSynchronizer~%
-\cite{fujihara_lyricsynchronizer:_2011}). However, these techniques are designed
-to detect a clean singing voice and have not been tested on so-called
-\emph{extended vocal techniques} such as grunting or growling. Growling is
-heavily used in extreme metal genres such as \gls{dm} but it must be noted that
-grunting is not a technique only used in extreme metal styles. Similar or equal
-techniques have been used in \emph{Beijing opera}, Japanese \emph{Noh} and but
-also more western styles like jazz singing by Louis
-Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to viking
-times. For example, an arab merchant visiting a village in Denmark wrote in the
-tenth century~\cite{friis_vikings_2004}:
+for segmenting audio and even forced temporal alignment (e.g.\ 
+LyricSynchronizer~\cite{fujihara_lyricsynchronizer:_2011}). However, these
+techniques are designed to detect a clean singing voice and have not been
+tested on so-called \emph{extended vocal techniques} such as grunting or
+growling. Growling is heavily used in extreme metal genres such as \gls{dm} but
+it must be noted that grunting is not a technique only used in extreme metal
+styles. Similar or equal techniques have been used in \emph{Beijing opera},
+Japanese \emph{Noh} and but also more western styles like jazz singing by Louis
+Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to
+viking times. For example, an arab merchant visiting a village in Denmark wrote
+in the tenth century~\cite{friis_vikings_2004}:
 
 \begin{displayquote}
 	Never before I have heard uglier songs than those of the Vikings in
@@ -47,20 +47,21 @@ tenth century~\cite{friis_vikings_2004}:
 \section{Related work}
 Applying speech related processing and classification techniques on music
 already started in the late 90s. Saunders et al.\ devised a technique to
-classify audio in the categories \emph{Music} and \emph{Speech}. They was found
-that music has different properties than speech. Music has more bandwidth,
-tonality and regularity. Multivariate Gaussian classifiers were used to
-discriminate the categories with an average performance of $90\%%
-$~\cite{saunders_real-time_1996}.
+classify audio in the categories \emph{Music} and \emph{Speech}. They found
+that music has different properties than speech. Music uses a wider spectral
+bandwidth in which events happen. Music contains more tonality and rhythm.
+Multivariate Gaussian classifiers were used to discriminate the categories with
+an average performance of $90\%$~\cite{saunders_real-time_1996}.
 
 Williams and Ellis were inspired by the aforementioned research and tried to
-separate the singing segments from the instrumental
-segments~\cite{williams_speech/music_1999}. This was later verified by
+separate the singing segments from the instrumental segments~%
+\cite{williams_speech/music_1999}. Their results were later verified by
 Berenzweig and Ellis~\cite{berenzweig_locating_2001}. The latter became the de
 facto literature on singing voice detection. Both show that features derived
-from \gls{PPF} such as energy and distribution are highly effective in
-separating speech from non-speech signals such as music. The data used was
-already segmented.
+from \gls{PPF} such as energy are highly effective in separating speech from
+non-speech signals such as music. The data used in the experiments was
+segmented in to segments that only contained data from one class. The
+classifier determined the classper sample.
 
 Later, Berenzweig showed singing voice segments to be more useful for artist
 classification and used an \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients
@@ -80,7 +81,8 @@ by applying \gls{Viterbi} alignment on the segregated signals with the lyrics.
 The system showed accuracy levels of $90\%$ on Japanese music~%
 \cite{fujihara_automatic_2006}. Later they improved hereupon~%
 \cite{fujihara_three_2008} and even made a ready to use karaoke application
-that can do the this online~\cite{fujihara_lyricsynchronizer:_2011}.
+that can do the temporal lyrics alignment online~%
+\cite{fujihara_lyricsynchronizer:_2011}.
 
 Singing voice detection can also be seen as a binary genre recognition problem.
 Therefore the techniques used in that field might be of use.  Genre recognition
@@ -98,12 +100,12 @@ classical Turkish music~\cite{dzhambazov_automatic_2014}.
 \section{Research question}
 It is debatable whether the aforementioned techniques work because the
 spectral properties of a growling voice is different from the spectral
-properties of a clean singing voice. It has been found that growling voices
-have less prominent peaks in the frequency representation and are closer to
-noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the
+properties of a clean singing voice. It has been found that growling-like
+vocals have less prominent peaks in the frequency representation and are closer
+to noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the
 research question:
 
 \begin{center}\em%
-	Are standard \gls{ANN} based techniques for singing voice detection
-	suitable for non-standard musical genres like \gls{dm} and \gls{dom}?
+	Are standard techniques for singing voice detection suitable for
+	non-standard musical genres containing extreme vocal styles?
 \end{center}