From: Mart Lubbers Date: Wed, 7 Jun 2017 11:18:19 +0000 (+0200) Subject: process comments for chapter 1 completely X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=df27f7c8cc1ea29b04747ee7bc2a87c367cd3940;p=asr1617.git process comments for chapter 1 completely --- diff --git a/acronyms.tex b/acronyms.tex index b65e5c8..03f74aa 100644 --- a/acronyms.tex +++ b/acronyms.tex @@ -1,7 +1,7 @@ \newacronym{ANN}{ANN}{Artificial Neural Network} \newacronym{DCT}{DCT}{Discrete Cosine Transform} \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}} -\newacronym{FA}{FA}{Forced alignment} +\newacronym{FA}{FA}{Forced Alignment} \newacronym{GMM}{GMM}{Gaussian Mixture Models} \newacronym{HMM}{HMM}{Hidden Markov Model} \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit} diff --git a/intro.tex b/intro.tex index b409685..eb482f1 100644 --- a/intro.tex +++ b/intro.tex @@ -14,28 +14,28 @@ available for consumers. Lyrics for tracks are in almost all cases amply available. However, a temporal alignment of the lyrics is not and creating it involves manual labour. -A lot of the current day musical distribution goes via non-official channels such as -YouTube\footnote{\url{https://youtube.com}} in which fans of the performers -often accompany the music with synchronized lyrics. This means that there is an -enormous treasure of lyrics-annotated music available. However, the data is not -within our reach since the subtitles are almost always hardcoded into the video -stream and thus not directly accessible as data. It sparks the ideas for -creating automatic techniques for segmenting instrumental and vocal parts of a -song, apply forced temporal alignment or possible even apply lyrics recognition -audio data. +A lot of the current day musical distribution goes via non-official channels +such as YouTube\footnote{\url{https://youtube.com}} in which fans of the +performers often accompany the music with synchronized lyrics. This means that +there is an enormous treasure of lyrics-annotated music available. However, the +data is not within our reach since the subtitles are almost always hardcoded +into the video stream and thus not directly accessible as data. It sparks the +ideas for creating automatic techniques for segmenting instrumental and vocal +parts of a song, apply forced temporal alignment or possible even apply lyrics +recognition audio data. These techniques are heavily researched and working systems have been created -for segmenting audio and even forced alignment (e.g.\ LyricSynchronizer~% -\cite{fujihara_lyricsynchronizer:_2011}). However, these techniques are designed -to detect a clean singing voice and have not been tested on so-called -\emph{extended vocal techniques} such as grunting or growling. Growling is -heavily used in extreme metal genres such as \gls{dm} but it must be noted that -grunting is not a technique only used in extreme metal styles. Similar or equal -techniques have been used in \emph{Beijing opera}, Japanese \emph{Noh} and but -also more western styles like jazz singing by Louis -Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to viking -times. For example, an arab merchant visiting a village in Denmark wrote in the -tenth century~\cite{friis_vikings_2004}: +for segmenting audio and even forced temporal alignment (e.g.\ +LyricSynchronizer~\cite{fujihara_lyricsynchronizer:_2011}). However, these +techniques are designed to detect a clean singing voice and have not been +tested on so-called \emph{extended vocal techniques} such as grunting or +growling. Growling is heavily used in extreme metal genres such as \gls{dm} but +it must be noted that grunting is not a technique only used in extreme metal +styles. Similar or equal techniques have been used in \emph{Beijing opera}, +Japanese \emph{Noh} and but also more western styles like jazz singing by Louis +Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to +viking times. For example, an arab merchant visiting a village in Denmark wrote +in the tenth century~\cite{friis_vikings_2004}: \begin{displayquote} Never before I have heard uglier songs than those of the Vikings in @@ -47,20 +47,21 @@ tenth century~\cite{friis_vikings_2004}: \section{Related work} Applying speech related processing and classification techniques on music already started in the late 90s. Saunders et al.\ devised a technique to -classify audio in the categories \emph{Music} and \emph{Speech}. They was found -that music has different properties than speech. Music has more bandwidth, -tonality and regularity. Multivariate Gaussian classifiers were used to -discriminate the categories with an average performance of $90\%% -$~\cite{saunders_real-time_1996}. +classify audio in the categories \emph{Music} and \emph{Speech}. They found +that music has different properties than speech. Music uses a wider spectral +bandwidth in which events happen. Music contains more tonality and rhythm. +Multivariate Gaussian classifiers were used to discriminate the categories with +an average performance of $90\%$~\cite{saunders_real-time_1996}. Williams and Ellis were inspired by the aforementioned research and tried to -separate the singing segments from the instrumental -segments~\cite{williams_speech/music_1999}. This was later verified by +separate the singing segments from the instrumental segments~% +\cite{williams_speech/music_1999}. Their results were later verified by Berenzweig and Ellis~\cite{berenzweig_locating_2001}. The latter became the de facto literature on singing voice detection. Both show that features derived -from \gls{PPF} such as energy and distribution are highly effective in -separating speech from non-speech signals such as music. The data used was -already segmented. +from \gls{PPF} such as energy are highly effective in separating speech from +non-speech signals such as music. The data used in the experiments was +segmented in to segments that only contained data from one class. The +classifier determined the classper sample. Later, Berenzweig showed singing voice segments to be more useful for artist classification and used an \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients @@ -80,7 +81,8 @@ by applying \gls{Viterbi} alignment on the segregated signals with the lyrics. The system showed accuracy levels of $90\%$ on Japanese music~% \cite{fujihara_automatic_2006}. Later they improved hereupon~% \cite{fujihara_three_2008} and even made a ready to use karaoke application -that can do the this online~\cite{fujihara_lyricsynchronizer:_2011}. +that can do the temporal lyrics alignment online~% +\cite{fujihara_lyricsynchronizer:_2011}. Singing voice detection can also be seen as a binary genre recognition problem. Therefore the techniques used in that field might be of use. Genre recognition @@ -98,12 +100,12 @@ classical Turkish music~\cite{dzhambazov_automatic_2014}. \section{Research question} It is debatable whether the aforementioned techniques work because the spectral properties of a growling voice is different from the spectral -properties of a clean singing voice. It has been found that growling voices -have less prominent peaks in the frequency representation and are closer to -noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the +properties of a clean singing voice. It has been found that growling-like +vocals have less prominent peaks in the frequency representation and are closer +to noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the research question: \begin{center}\em% - Are standard \gls{ANN} based techniques for singing voice detection - suitable for non-standard musical genres like \gls{dm} and \gls{dom}? + Are standard techniques for singing voice detection suitable for + non-standard musical genres containing extreme vocal styles? \end{center}