process comments for introduction

author Mart Lubbers <mart@martlubbers.net>

Wed, 7 Jun 2017 11:06:00 +0000 (13:06 +0200)

committer Mart Lubbers <mart@martlubbers.net>

Wed, 7 Jun 2017 11:06:00 +0000 (13:06 +0200)
author Mart Lubbers <mart@martlubbers.net>
Wed, 7 Jun 2017 11:06:00 +0000 (13:06 +0200)
committer Mart Lubbers <mart@martlubbers.net>
Wed, 7 Jun 2017 11:06:00 +0000 (13:06 +0200)
diff --git a/acronyms.tex b/acronyms.tex

index d26f028..b65e5c8 100644 (file)
--- a/acronyms.tex
+++ b/acronyms.tex
@@ -6,10 +6,10 @@
  \newacronym{HMM}{HMM}{Hidden Markov Model}
  \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
  \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
-\newacronym{LPCC}{LPCC}{\acrlong{LPC} derived cepstrum}
+\newacronym{LPCC}{LPCC}{\acrlong{LPC} Derived Cepstrum}
  \newacronym{LPC}{LPC}{Linear Prediction Coefficients}
-\newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
-\newacronym{MFC}{MFC}{Mel-frequency cepstrum}
+\newacronym{MFCC}{MFCC}{\acrlong{MFC} Coefficient}
+\newacronym{MFC}{MFC}{Mel-frequency Cepstrum}
  \newacronym{MLP}{MLP}{Multi-layer Perceptron}
  \newacronym{PLP}{PLP}{Perceptual Linear Prediction}
  \newacronym{PPF}{PPF}{Posterior Probability Features}
diff --git a/glossaries.tex b/glossaries.tex

index b00ee19..38e5895 100644 (file)
--- a/glossaries.tex
+++ b/glossaries.tex
@@ -8,7 +8,8 @@
         description={is a technique of converting a time representation signal to a
         frequency representation}}
  \newglossaryentry{MS}{name={Mel-Scale},
-       description={is a human ear inspired scale for spectral signals}}
+       description={is a method of warping the spectral representation to closer
+       match the human ear}}
  \newglossaryentry{Viterbi}{name={Viterbi},
         description={is a dynamic programming algorithm for finding the most likely
         sequence of hidden states in a \gls{HMM}}}
diff --git a/intro.tex b/intro.tex

index c649ba4..b409685 100644 (file)
--- a/intro.tex
+++ b/intro.tex
@@ -1,30 +1,31 @@
  \section{Introduction}
  The primary medium for music distribution is rapidly changing from physical
-media to digital media. The \gls{IFPI} stated that about $43\%$ of music
-revenue arises from digital distribution. Another $39\%$ arises from the
+media to digital media. In 2016 the \gls{IFPI} stated that about $43\%$ of
+music revenue arises from digital distribution. Another $39\%$ arises from the
  physical sale and the remaining $16\%$ is made through performance and
  synchronisation revenues. The overtake of digital formats on physical formats
  took place somewhere in 2015. Moreover, ever since twenty years the music
-industry has seen significant growth 
+industry has seen significant growth
  again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  
  There has always been an interest in lyrics to music alignment to be used in
-for example karaoke. As early as in the late 1980s karaoke machines were
-available for consumers. While the lyrics for the track are almost always
-available, an alignment is not and it involves manual labour to create such an
-alignment.
+for example karaoke. As early as in the late 1980s, karaoke machines became
+available for consumers. Lyrics for tracks are in almost all cases amply
+available. However, a temporal alignment of the lyrics is not and creating it
+involves manual labour.
  
-A lot of this musical distribution goes via non-official channels such as
+A lot of the current day musical distribution goes via non-official channels such as
  YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
  often accompany the music with synchronized lyrics. This means that there is an
-enormous treasure of lyrics-annotated music available but not within our reach
-since the subtitles are almost always hardcoded into the video stream and thus
-not directly usable as data. Because of this interest it is very useful to
-device automatic techniques for segmenting instrumental and vocal parts of a
-song, apply forced alignment or even lyrics recognition on the audio file.
+enormous treasure of lyrics-annotated music available. However, the data is not
+within our reach since the subtitles are almost always hardcoded into the video
+stream and thus not directly accessible as data. It sparks the ideas for
+creating automatic techniques for segmenting instrumental and vocal parts of a
+song, apply forced temporal alignment or possible even apply lyrics recognition
+audio data.
  
  These techniques are heavily researched and working systems have been created
-for segmenting audio and even forced alignment (e.g.\ LyricSynchronizer%
+for segmenting audio and even forced alignment (e.g.\ LyricSynchronizer~%
  \cite{fujihara_lyricsynchronizer:_2011}). However, these techniques are designed
  to detect a clean singing voice and have not been tested on so-called
  \emph{extended vocal techniques} such as grunting or growling. Growling is
@@ -32,9 +33,9 @@ heavily used in extreme metal genres such as \gls{dm} but it must be noted that
  grunting is not a technique only used in extreme metal styles. Similar or equal
  techniques have been used in \emph{Beijing opera}, Japanese \emph{Noh} and but
  also more western styles like jazz singing by Louis
-Armstrong\cite{sakakibara_growl_2004}. It might even be traced back to viking
+Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to viking
  times. For example, an arab merchant visiting a village in Denmark wrote in the
-tenth century\cite{friis_vikings_2004}:
+tenth century~\cite{friis_vikings_2004}:
  
  \begin{displayquote}
         Never before I have heard uglier songs than those of the Vikings in
@@ -50,12 +51,12 @@ classify audio in the categories \emph{Music} and \emph{Speech}. They was found
  that music has different properties than speech. Music has more bandwidth,
  tonality and regularity. Multivariate Gaussian classifiers were used to
  discriminate the categories with an average performance of $90\%%
-$\cite{saunders_real-time_1996}.
+$~\cite{saunders_real-time_1996}.
  
  Williams and Ellis were inspired by the aforementioned research and tried to
  separate the singing segments from the instrumental
-segments\cite{williams_speech/music_1999}. This was later verified by
-Berenzweig and Ellis\cite{berenzweig_locating_2001}. The latter became the de
+segments~\cite{williams_speech/music_1999}. This was later verified by
+Berenzweig and Ellis~\cite{berenzweig_locating_2001}. The latter became the de
  facto literature on singing voice detection. Both show that features derived
  from \gls{PPF} such as energy and distribution are highly effective in
  separating speech from non-speech signals such as music. The data used was
@@ -63,43 +64,43 @@ already segmented.
  
  Later, Berenzweig showed singing voice segments to be more useful for artist
  classification and used an \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients
-to detect a singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed that
+to detect a singing voice~\cite{berenzweig_using_2002}. Nwe et al.\ showed that
  there is not much difference in accuracy when using different features founded
  in speech processing. They tested several features and found accuracies differ
  less that a few percent. Moreover, they found that others have tried to tackle
  the problem using myriads of different approaches such as using \gls{ZCR},
  \gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} as
-classifiers\cite{nwe_singing_2004}.
+classifiers~\cite{nwe_singing_2004}.
  
  Fujihara et al.\ took the idea to a next level by attempting to do \gls{FA} on
  music. Their approach is a three step approach. The first step is reducing the
  accompaniment levels, secondly the vocal segments are separated from the
  non-vocal segments using a simple two-state \gls{HMM}. The chain is concluded
  by applying \gls{Viterbi} alignment on the segregated signals with the lyrics.
-The system showed accuracy levels of $90\%$ on Japanese music%
-\cite{fujihara_automatic_2006}. Later they improved hereupon%
+The system showed accuracy levels of $90\%$ on Japanese music~%
+\cite{fujihara_automatic_2006}. Later they improved hereupon~%
  \cite{fujihara_three_2008} and even made a ready to use karaoke application
-that can do the this online\cite{fujihara_lyricsynchronizer:_2011}.
+that can do the this online~\cite{fujihara_lyricsynchronizer:_2011}.
  
  Singing voice detection can also be seen as a binary genre recognition problem.
  Therefore the techniques used in that field might be of use.  Genre recognition
  has a long history that can be found in the survey by
-Sturm\cite{sturm_survey_2012}. It must be noted that of all the $485$ papers
+Sturm~\cite{sturm_survey_2012}. It must be noted that of all the $485$ papers
  cited by Sturm only one master thesis is applying genre recognition on heavy
-metal genres\cite{tsatsishvili_automatic_2011}.
+metal genres~\cite{tsatsishvili_automatic_2011}.
  
  Singing voice detection has been tried on less conventional styles in the past.
  Dzhambazov et al.\ proposed to align long syllables in Beijing Opera to the
-audio\cite{dzhambazov_automatic_2016}. Beijing Opera sometimes contains
+audio~\cite{dzhambazov_automatic_2016}. Beijing Opera sometimes contains
  growling like vocals. Dzhambazov also tried aligning lyrics to audio in
-classical Turkish music\cite{dzhambazov_automatic_2014}.
+classical Turkish music~\cite{dzhambazov_automatic_2014}.
  
  \section{Research question}
  It is debatable whether the aforementioned techniques work because the
  spectral properties of a growling voice is different from the spectral
  properties of a clean singing voice. It has been found that growling voices
  have less prominent peaks in the frequency representation and are closer to
-noise than clean singing\cite{kato_acoustic_2013}. This leads us to the
+noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the
  research question:
  
  \begin{center}\em%
author	Mart Lubbers <mart@martlubbers.net>
	Wed, 7 Jun 2017 11:06:00 +0000 (13:06 +0200)
committer	Mart Lubbers <mart@martlubbers.net>
	Wed, 7 Jun 2017 11:06:00 +0000 (13:06 +0200)
acronyms.tex		patch \| blob \| history
glossaries.tex		patch \| blob \| history
intro.tex		patch \| blob \| history