process comments for introduction
[asr1617.git] / intro.tex
1 \section{Introduction}
2 The primary medium for music distribution is rapidly changing from physical
3 media to digital media. In 2016 the \gls{IFPI} stated that about $43\%$ of
4 music revenue arises from digital distribution. Another $39\%$ arises from the
5 physical sale and the remaining $16\%$ is made through performance and
6 synchronisation revenues. The overtake of digital formats on physical formats
7 took place somewhere in 2015. Moreover, ever since twenty years the music
8 industry has seen significant growth
9 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
10
11 There has always been an interest in lyrics to music alignment to be used in
12 for example karaoke. As early as in the late 1980s, karaoke machines became
13 available for consumers. Lyrics for tracks are in almost all cases amply
14 available. However, a temporal alignment of the lyrics is not and creating it
15 involves manual labour.
16
17 A lot of the current day musical distribution goes via non-official channels such as
18 YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
19 often accompany the music with synchronized lyrics. This means that there is an
20 enormous treasure of lyrics-annotated music available. However, the data is not
21 within our reach since the subtitles are almost always hardcoded into the video
22 stream and thus not directly accessible as data. It sparks the ideas for
23 creating automatic techniques for segmenting instrumental and vocal parts of a
24 song, apply forced temporal alignment or possible even apply lyrics recognition
25 audio data.
26
27 These techniques are heavily researched and working systems have been created
28 for segmenting audio and even forced alignment (e.g.\ LyricSynchronizer~%
29 \cite{fujihara_lyricsynchronizer:_2011}). However, these techniques are designed
30 to detect a clean singing voice and have not been tested on so-called
31 \emph{extended vocal techniques} such as grunting or growling. Growling is
32 heavily used in extreme metal genres such as \gls{dm} but it must be noted that
33 grunting is not a technique only used in extreme metal styles. Similar or equal
34 techniques have been used in \emph{Beijing opera}, Japanese \emph{Noh} and but
35 also more western styles like jazz singing by Louis
36 Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back to viking
37 times. For example, an arab merchant visiting a village in Denmark wrote in the
38 tenth century~\cite{friis_vikings_2004}:
39
40 \begin{displayquote}
41 Never before I have heard uglier songs than those of the Vikings in
42 Slesvig. The growling sound coming from their throats reminds me of dogs
43 howling, only more untamed.
44 \end{displayquote}
45
46 %Literature overview / related work
47 \section{Related work}
48 Applying speech related processing and classification techniques on music
49 already started in the late 90s. Saunders et al.\ devised a technique to
50 classify audio in the categories \emph{Music} and \emph{Speech}. They was found
51 that music has different properties than speech. Music has more bandwidth,
52 tonality and regularity. Multivariate Gaussian classifiers were used to
53 discriminate the categories with an average performance of $90\%%
54 $~\cite{saunders_real-time_1996}.
55
56 Williams and Ellis were inspired by the aforementioned research and tried to
57 separate the singing segments from the instrumental
58 segments~\cite{williams_speech/music_1999}. This was later verified by
59 Berenzweig and Ellis~\cite{berenzweig_locating_2001}. The latter became the de
60 facto literature on singing voice detection. Both show that features derived
61 from \gls{PPF} such as energy and distribution are highly effective in
62 separating speech from non-speech signals such as music. The data used was
63 already segmented.
64
65 Later, Berenzweig showed singing voice segments to be more useful for artist
66 classification and used an \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients
67 to detect a singing voice~\cite{berenzweig_using_2002}. Nwe et al.\ showed that
68 there is not much difference in accuracy when using different features founded
69 in speech processing. They tested several features and found accuracies differ
70 less that a few percent. Moreover, they found that others have tried to tackle
71 the problem using myriads of different approaches such as using \gls{ZCR},
72 \gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} as
73 classifiers~\cite{nwe_singing_2004}.
74
75 Fujihara et al.\ took the idea to a next level by attempting to do \gls{FA} on
76 music. Their approach is a three step approach. The first step is reducing the
77 accompaniment levels, secondly the vocal segments are separated from the
78 non-vocal segments using a simple two-state \gls{HMM}. The chain is concluded
79 by applying \gls{Viterbi} alignment on the segregated signals with the lyrics.
80 The system showed accuracy levels of $90\%$ on Japanese music~%
81 \cite{fujihara_automatic_2006}. Later they improved hereupon~%
82 \cite{fujihara_three_2008} and even made a ready to use karaoke application
83 that can do the this online~\cite{fujihara_lyricsynchronizer:_2011}.
84
85 Singing voice detection can also be seen as a binary genre recognition problem.
86 Therefore the techniques used in that field might be of use. Genre recognition
87 has a long history that can be found in the survey by
88 Sturm~\cite{sturm_survey_2012}. It must be noted that of all the $485$ papers
89 cited by Sturm only one master thesis is applying genre recognition on heavy
90 metal genres~\cite{tsatsishvili_automatic_2011}.
91
92 Singing voice detection has been tried on less conventional styles in the past.
93 Dzhambazov et al.\ proposed to align long syllables in Beijing Opera to the
94 audio~\cite{dzhambazov_automatic_2016}. Beijing Opera sometimes contains
95 growling like vocals. Dzhambazov also tried aligning lyrics to audio in
96 classical Turkish music~\cite{dzhambazov_automatic_2014}.
97
98 \section{Research question}
99 It is debatable whether the aforementioned techniques work because the
100 spectral properties of a growling voice is different from the spectral
101 properties of a clean singing voice. It has been found that growling voices
102 have less prominent peaks in the frequency representation and are closer to
103 noise than clean singing~\cite{kato_acoustic_2013}. This leads us to the
104 research question:
105
106 \begin{center}\em%
107 Are standard \gls{ANN} based techniques for singing voice detection
108 suitable for non-standard musical genres like \gls{dm} and \gls{dom}?
109 \end{center}