intro.tex

   1 \section{Introduction}
   2 The primary medium for music distribution is rapidly changing from physical
   3 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
   4 revenue rises from digital distribution. Another $39\%$ arises from the
   5 physical sale and the remaining $16\%$ is made through performance and
   6 synchronisation revenieus. The overtake of digital formats on physical formats
   7 took place somewhere in 2015. Moreover, ever since twenty years the music
   8 industry has seen significant growth
   9 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
  10
  11 There has always been an interest in lyrics to music alignment to be used in
  12 for example karaoke. As early as in the late 1980s karaoke machines were
  13 available for consumers. While the lyrics for the track are almost always
  14 available, a alignment is not and it involves manual labour to create such an
  15 alignment.
  16
  17 A lot of this musical distribution goes via non-official channels such as
  18 YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
  19 often accompany the music with synchronized lyrics. This means that there is an
  20 enormous treasure of lyrics-annotated music available but not within our reach
  21 since the subtitles are almost always hardcoded into the video stream and thus
  22 not directly usable as data. Because of this interest it is very useful to
  23 device automatic techniques for segmenting instrumental and vocal parts of a
  24 song, apply forced alignment or even lyrics recognition on the audio file.
  25
  26 Such techniques are heavily researched and working systems have been created.
  27 However, these techniques are designed to detect a clean singing voice and have
  28 not been testen on so-called \emph{extended vocal techniques} such as grunting
  29 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
  30 but it must be noted that grunting is not a technique only used in extreme
  31 metal styles. Similar or equal techniques have been used in \emph{Beijing
  32 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
  33 by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
  34 to viking times. For example, an arab merchant visiting a village in Denmark
  35 wrote in the tenth century\cite{friis_vikings_2004}:
  36
  37 \begin{displayquote}
  38         Never before I have heard uglier songs than those of the Vikings in
  39         Slesvig. The growling sound coming from their throats reminds me of dogs
  40         howling, only more untamed.
  41 \end{displayquote}
  42
  43 \section{\gls{dm}}
  44
  45 %Literature overview / related work
  46 \section{Related work}
  47 Applying speech related processing and classification techniques on music
  48 already started in the late 90s. Saunders et al.\ devised a technique to
  49 classify audio in the categories \emph{Music} and \emph{Speech}. It was found
  50 that music has different properties than speech. Music has more bandwidth,
  51 tonality and regularity. Multivariate Gaussian classifiers were used to
  52 discriminate the categories with an average performance of $90\%$.
  53
  54 Williams and Ellis were inspired by the aforementioned research and tried to
  55 separate the singing segments from the instrumental
  56 segments\cite{williams_speech/music_1999}. This was later verified by
  57 Berenzweig and Ellis\cite{berenzweig_locating_2001}. The latter became the de
  58 facto literature on singing voice detection. Both show that features derived
  59 from \gls{PPF} such as energy and distribution are highly effective in
  60 separating speech from non-speech signals such as music. The data used was
  61 already segmented.
  62
  63 Later, Berenzweig showed singing voice segments to be more useful for artist
  64 classification and used a \gls{MLP} using \gls{PLP} coefficients to separate
  65 detect singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed that there
  66 is not much difference in accuracy when using different features founded in
  67 speech processing. They tested several features and found accuracies differ
  68 less that a few percent. Moreover, they found that others have tried to tackle
  69 the problem using myriads of different approaches such as using \gls{ZCR},
  70 \gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} as
  71 classifiers\cite{nwe_singing_2004}.
  72
  73 Fujihara et al.\ took the idea to a next level by attempting to do \gls{FA} on
  74 music. Their approach is a three step approach. First step is reducing the
  75 accompaniment levels, secondly the vocal segments are
  76 separated from the non-vocal segments using a simple two-state \gls{HMM}.
  77 The chain is concluded by applying \gls{Viterbi} alignment on the segregated
  78 signals with the lyrics. The system showed accuracy levels of $90\%$ on
  79 Japanese music\cite{fujihara_automatic_2006}. Later they improved
  80 hereupon\cite{fujihara_three_2008} and even made a ready to use karaoke
  81 application that can do the this online\cite{fujihara_lyricsynchronizer:_2011}.
  82
  83 Singing voice detection can also be seen as a binary genre recognition problem.
  84 Therefore the techniques used in that field might be of use.  Genre recognition
  85 has a long history that can be found in the survey by
  86 Sturm\cite{sturm_survey_2012}. It must be noted that of all the $485$ papers
  87 cited by Sturm only one master thesis is applying genre recognition on heavy
  88 metal genres\cite{tsatsishvili_automatic_2011}.
  89
  90 Singing voice detection has been tried on less conventional styles in the past.
  91 Dzhambazov et al.\ proposed to align long syllables in Beijing Opera to the
  92 audio\cite{dzhambazov_automatic_2016}. Beijing Opera sometimes contains
  93 growling like vocals. Dzhambazov also tried aligning lyrics to audio in
  94 classical Turkish music\cite{dzhambazov_automatic_2014}.
  95
  96 \section{Research question}
  97 It is discutable whether the aforementioned techniques work because the
  98 spectral properties of a growling voice is different from the spectral
  99 properties of a clean singing voice. It has been found that growling voices
 100 have less prominent peaks in the frequency representation and are closer to
 101 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
 102 research question:
 103
 104 \begin{center}\em%
 105         Are standard \gls{ANN} based techniques for singing voice detection
 106         suitable for non-standard musical genres like \gls{dm} and \gls{dom}.
 107 \end{center}