intro.tex

   1 \section{Introduction}
   2 The primary medium for music distribution is rapidly changing from physical
   3 media to digital media. In 2016 the \gls{IFPI} stated that about $50\%$ of
   4 music revenue arises from digital distribution. Another $34\%$ arises from the
   5 physical sale and the remaining $16\%$ is made through performance and
   6 synchronisation revenues. The overtake of digital formats on physical formats
   7 took place somewhere in 2015. Moreover, ever since twenty years the music
   8 industry has seen significant growth
   9 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  10
  11 There has always been an interest in lyrics to music alignment to be used in
  12 for example karaoke. As early as in the late 1980s, karaoke machines became
  13 available for consumers. Lyrics for tracks are in almost all cases amply
  14 available. However, a temporal alignment of the lyrics is not and creating it
  15 involves manual labour.
  16
  17 A lot of the current day music distribution goes via non-official channels
  18 such as YouTube\footnote{\url{https://youtube.com}} in which fans of the
  19 performers often accompany the music with synchronized lyrics. This means that
  20 there is an enormous treasure of lyrics-annotated music available. However, the
  21 data is not within our reach since the subtitles are almost always hardcoded
  22 into the video stream and thus not directly accessible as data. It sparks the
  23 ideas for creating automatic techniques for segmenting instrumental and vocal
  24 parts of a song, apply forced temporal alignment or possible even apply lyrics
  25 recognition audio data.
  26
  27 These techniques are heavily researched and working systems have been created
  28 for segmenting audio and even forced temporal alignment (e.g.\
  29 LyricSynchronizer~\cite{fujihara_lyricsynchronizer:_2011}). However, these
  30 techniques are designed to detect a clean singing voice and have not been
  31 tested on so-called \emph{extended vocal techniques} such as grunting or
  32 growling. Growling is heavily used in extreme metal genres such as \gls{dm} but
  33 it must be noted that grunting is not a technique only used in extreme metal
  34 styles. Similar or equal techniques have been used in \emph{Beijing opera},
  35 Japanese \emph{Noh} and but also more western styles like jazz singing by Louis
  36 Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to
  37 viking times. For example, an arab merchant visiting a village in Denmark wrote
  38 in the tenth century~\cite{friis_vikings_2004}:
  39
  40 \begin{displayquote}
  41         Never before I have heard uglier songs than those of the Vikings in
  42         Slesvig. The growling sound coming from their throats reminds me of dogs
  43         howling, only more untamed.
  44 \end{displayquote}
  45
  46 %Literature overview / related work
  47 \section{Related work}
  48 Applying speech related processing and classification techniques on music
  49 already started in the late 90s. Saunders et al.\ devised a technique to
  50 classify audio in the categories \emph{Music} and \emph{Speech}. They found
  51 that music has different properties than speech. Music uses a wider spectral
  52 bandwidth in which events happen. Music contains more tonality and rhythm.
  53 Multivariate Gaussian classifiers were used to discriminate the categories with
  54 an average accuracy of $90\%$~\cite{saunders_real-time_1996}.
  55
  56 Williams and Ellis were inspired by the aforementioned research and tried to
  57 separate the singing segments from the instrumental segments~%
  58 \cite{williams_speech/music_1999}. Their results were later verified by
  59 Berenzweig and Ellis~\cite{berenzweig_locating_2001}. The latter became the de
  60 facto literature on singing voice detection. Both show that features derived
  61 from \gls{PPF} such as energy are highly effective in separating speech from
  62 non-speech signals such as music. The data used in the experiments was
  63 segmented in to segments that only contained data from one class. The
  64 classifier determined the classper sample.
  65
  66 Later, Berenzweig showed singing voice segments to be more useful for artist
  67 classification and used an \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients
  68 to detect a singing voice~\cite{berenzweig_using_2002}. Nwe et al.\ showed that
  69 there is not much difference in accuracy when using different features founded
  70 in speech processing. They tested several features and found accuracies differ
  71 less than a few percent. Moreover, they found that others have tried to tackle
  72 the problem using myriads of different approaches such as using \gls{ZCR},
  73 \gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} as
  74 classifiers~\cite{nwe_singing_2004}.
  75
  76 Fujihara et al.\ took the idea to a next level by attempting to do \gls{FA} on
  77 music. Their approach is a three step approach. The first step is reducing the
  78 accompaniment levels, secondly the vocal segments are separated from the
  79 non-vocal segments using a simple two-state \gls{HMM}. The chain is concluded
  80 by applying \gls{Viterbi} alignment on the segregated signals with the lyrics.
  81 The system showed accuracy levels of $90\%$ on Japanese music~%
  82 \cite{fujihara_automatic_2006}. Later they improved hereupon~%
  83 \cite{fujihara_three_2008} and even made a ready to use karaoke application
  84 that can do the temporal lyrics alignment online~%
  85 \cite{fujihara_lyricsynchronizer:_2011}.
  86
  87 Singing voice detection can also be seen as a binary genre recognition problem.
  88 Therefore the techniques used in that field might be of use.  Genre recognition
  89 has a long history that can be found in the survey by
  90 Sturm~\cite{sturm_survey_2012}. It must be noted that of all the $485$ papers
  91 cited by Sturm only one master thesis is applying genre recognition on heavy
  92 metal genres~\cite{tsatsishvili_automatic_2011}.
  93
  94 Singing voice detection has been tried on less conventional styles in the past.
  95 Dzhambazov et al.\ proposed to align long syllables in Beijing Opera to the
  96 audio~\cite{dzhambazov_automatic_2016}. Beijing Opera sometimes contains
  97 growling like vocals. Dzhambazov also tried aligning lyrics to audio in
  98 classical Turkish music~\cite{dzhambazov_automatic_2014}.
  99
 100 \section{Research question}
 101 It is debatable whether the aforementioned techniques work because the
 102 spectral properties of a growling voice is different from the spectral
 103 properties of a clean singing voice. It has been found that growling-like
 104 vocals have less prominent peaks in the frequency representation and are closer
 105 to noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the
 106 research question:
 107
 108 \begin{center}\em%
 109         Are standard techniques for singing voice detection suitable for
 110         non-standard musical genres containing extreme vocal styles?
 111 \end{center}