available. However, a temporal alignment of the lyrics is not and creating it
involves manual labour.
-A lot of the current day musical distribution goes via non-official channels such as
-YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
-often accompany the music with synchronized lyrics. This means that there is an
-enormous treasure of lyrics-annotated music available. However, the data is not
-within our reach since the subtitles are almost always hardcoded into the video
-stream and thus not directly accessible as data. It sparks the ideas for
-creating automatic techniques for segmenting instrumental and vocal parts of a
-song, apply forced temporal alignment or possible even apply lyrics recognition
-audio data.
+A lot of the current day musical distribution goes via non-official channels
+such as YouTube\footnote{\url{https://youtube.com}} in which fans of the
+performers often accompany the music with synchronized lyrics. This means that
+there is an enormous treasure of lyrics-annotated music available. However, the
+data is not within our reach since the subtitles are almost always hardcoded
+into the video stream and thus not directly accessible as data. It sparks the
+ideas for creating automatic techniques for segmenting instrumental and vocal
+parts of a song, apply forced temporal alignment or possible even apply lyrics
+recognition audio data.
These techniques are heavily researched and working systems have been created
-for segmenting audio and even forced alignment (e.g.\ LyricSynchronizer~%
-\cite{fujihara_lyricsynchronizer:_2011}). However, these techniques are designed
-to detect a clean singing voice and have not been tested on so-called
-\emph{extended vocal techniques} such as grunting or growling. Growling is
-heavily used in extreme metal genres such as \gls{dm} but it must be noted that
-grunting is not a technique only used in extreme metal styles. Similar or equal
-techniques have been used in \emph{Beijing opera}, Japanese \emph{Noh} and but
-also more western styles like jazz singing by Louis
-Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to viking
-times. For example, an arab merchant visiting a village in Denmark wrote in the
-tenth century~\cite{friis_vikings_2004}:
+for segmenting audio and even forced temporal alignment (e.g.\
+LyricSynchronizer~\cite{fujihara_lyricsynchronizer:_2011}). However, these
+techniques are designed to detect a clean singing voice and have not been
+tested on so-called \emph{extended vocal techniques} such as grunting or
+growling. Growling is heavily used in extreme metal genres such as \gls{dm} but
+it must be noted that grunting is not a technique only used in extreme metal
+styles. Similar or equal techniques have been used in \emph{Beijing opera},
+Japanese \emph{Noh} and but also more western styles like jazz singing by Louis
+Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to
+viking times. For example, an arab merchant visiting a village in Denmark wrote
+in the tenth century~\cite{friis_vikings_2004}:
\begin{displayquote}
Never before I have heard uglier songs than those of the Vikings in
\section{Related work}
Applying speech related processing and classification techniques on music
already started in the late 90s. Saunders et al.\ devised a technique to
-classify audio in the categories \emph{Music} and \emph{Speech}. They was found
-that music has different properties than speech. Music has more bandwidth,
-tonality and regularity. Multivariate Gaussian classifiers were used to
-discriminate the categories with an average performance of $90\%%
-$~\cite{saunders_real-time_1996}.
+classify audio in the categories \emph{Music} and \emph{Speech}. They found
+that music has different properties than speech. Music uses a wider spectral
+bandwidth in which events happen. Music contains more tonality and rhythm.
+Multivariate Gaussian classifiers were used to discriminate the categories with
+an average performance of $90\%$~\cite{saunders_real-time_1996}.
Williams and Ellis were inspired by the aforementioned research and tried to
-separate the singing segments from the instrumental
-segments~\cite{williams_speech/music_1999}. This was later verified by
+separate the singing segments from the instrumental segments~%
+\cite{williams_speech/music_1999}. Their results were later verified by
Berenzweig and Ellis~\cite{berenzweig_locating_2001}. The latter became the de
facto literature on singing voice detection. Both show that features derived
-from \gls{PPF} such as energy and distribution are highly effective in
-separating speech from non-speech signals such as music. The data used was
-already segmented.
+from \gls{PPF} such as energy are highly effective in separating speech from
+non-speech signals such as music. The data used in the experiments was
+segmented in to segments that only contained data from one class. The
+classifier determined the classper sample.
Later, Berenzweig showed singing voice segments to be more useful for artist
classification and used an \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients
The system showed accuracy levels of $90\%$ on Japanese music~%
\cite{fujihara_automatic_2006}. Later they improved hereupon~%
\cite{fujihara_three_2008} and even made a ready to use karaoke application
-that can do the this online~\cite{fujihara_lyricsynchronizer:_2011}.
+that can do the temporal lyrics alignment online~%
+\cite{fujihara_lyricsynchronizer:_2011}.
Singing voice detection can also be seen as a binary genre recognition problem.
Therefore the techniques used in that field might be of use. Genre recognition
\section{Research question}
It is debatable whether the aforementioned techniques work because the
spectral properties of a growling voice is different from the spectral
-properties of a clean singing voice. It has been found that growling voices
-have less prominent peaks in the frequency representation and are closer to
-noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the
+properties of a clean singing voice. It has been found that growling-like
+vocals have less prominent peaks in the frequency representation and are closer
+to noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the
research question:
\begin{center}\em%
- Are standard \gls{ANN} based techniques for singing voice detection
- suitable for non-standard musical genres like \gls{dm} and \gls{dom}?
+ Are standard techniques for singing voice detection suitable for
+ non-standard musical genres containing extreme vocal styles?
\end{center}