small update
[asr1617.git] / asr.tex
1 %&asr
2 \usepackage[nonumberlist,acronyms]{glossaries}
3 %\makeglossaries%
4 \newacronym{ANN}{ANN}{Artificial Neural Network}
5 \newacronym{HMM}{HMM}{Hidden Markov Model}
6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
9 \newacronym{FA}{FA}{Forced alignment}
10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
13 \newglossaryentry{dm}{name={Death Metal},
14 description={is an extreme heavy metal music style with growling vocals and
15 pounding drums}}
16 \newglossaryentry{dom}{name={Doom Metal},
17 description={is an extreme heavy metal music style with growling vocals and
18 pounding drums played very slowly}}
19
20 \begin{document}
21 \frontmatter{}
22
23 \maketitleru[
24 course={(Automatic) Speech Recognition},
25 institute={Radboud University Nijmegen},
26 authorstext={Author:},
27 pagenr=1]
28 \listoftodos[Todo]
29
30 \tableofcontents
31
32 %Glossaries
33 %\glsaddall{}
34 %\printglossaries
35
36 \mainmatter{}
37 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
38 %detector for singing lines. They achive 80\% accuracy for forty 15 second
39 %exerpts. They mention people that wrote signal features that discriminate
40 %between speech and music. Neural net
41 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
42 %
43 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
44 %polyphonic turkish music, this might be interesting to use for heavy metal.
45 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
46 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
47 %detection, then melody extraction, then alignment. They compare results with
48 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
49 %specialize in long syllables in a capella. They use \glspl{DHMM} with
50 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
51 %has long syllables)~\cite{dzhambazov_automatic_2016}.
52 %
53
54
55 %Introduction, leading to a clearly defined research question
56 \chapter{Introduction}
57 \section{Introduction}
58 The primary medium for music distribution is rapidly changing from physical
59 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
60 revenue rises from digital distribution. Another $39\%$ arises from the
61 physical sale and the remaining $16\%$ is made through performance and
62 synchronisation revenieus. The overtake of digital formats on physical formats
63 took place somewhere in 2015. Moreover, ever since twenty years the music
64 industry has seen significant growth
65 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
66
67 There has always been an interest in lyrics to music alignment to be used in
68 for example karaoke. As early as in the late 1980s karaoke machines were
69 available for consumers. While the lyrics for the track are almost always
70 available, a alignment is not and it involves manual labour to create such an
71 alignment.
72
73 A lot of this musical distribution goes via non-official channels such as
74 YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
75 often accompany the music with synchronized lyrics. This means that there is an
76 enormous treasure of lyrics-annotated music available but not within our reach
77 since the subtitles are almost always hardcoded into the video stream and thus
78 not directly usable as data. Because of this interest it is very useful to
79 device automatic techniques for segmenting instrumental and vocal parts of a
80 song, apply forced alignment or even lyrics recognition on the audio file.
81
82 Such techniques are heavily researched and working systems have been created.
83 However, these techniques are designed to detect a clean singing voice and have
84 not been testen on so-called \emph{extended vocal techniques} such as grunting
85 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
86 but it must be noted that grunting is not a technique only used in extreme
87 metal styles. Similar or equal techniques have been used in \emph{Beijing
88 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
89 by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
90 to viking times. For example, an arab merchant visiting a village in Denmark
91 wrote in the tenth century\cite{friis_vikings_2004}:
92
93 \begin{displayquote}
94 Never before I have heard uglier songs than those of the Vikings in
95 Slesvig. The growling sound coming from their throats reminds me of dogs
96 howling, only more untamed.
97 \end{displayquote}
98
99 \section{\gls{dm}}
100
101 %Literature overview / related work
102 \section{Related work}
103 The field of applying standard speech processing techniques on music started in
104 the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
105 was found that music has different discriminating features compared to normal
106 speech.
107
108 Berenzweig and Ellis expanded on the aforementioned research by trying to
109 separate singing from instrumental music\cite{berenzweig_locating_2001}.
110
111 \todo{Incorporate this in literary framing}%
112 ~\cite{fujihara_automatic_2006}%
113 ~\cite{fujihara_lyricsynchronizer:_2011}%
114 ~\cite{fujihara_three_2008}%
115 ~\cite{mauch_integrating_2012}%
116 ~\cite{mesaros_adaptation_2009}%
117 ~\cite{mesaros_automatic_2008}%
118 ~\cite{mesaros_automatic_2010}%
119 ~%\cite{muller_multimodal_2012}%
120 ~\cite{pedone_phoneme-level_2011}%
121 ~\cite{yang_machine_2012}%
122
123
124
125 \section{Research question}
126 It is discutable whether the aforementioned techniques work because the
127 spectral properties of a growling voice is different from the spectral
128 properties of a clean singing voice. It has been found that growling voices
129 have less prominent peaks in the frequency representation and are closer to
130 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
131 research question:
132
133 \begin{center}\em%
134 Are standard \gls{ANN} based techniques for singing voice detection
135 suitable for non-standard musical genres like \gls{dm}.
136 \end{center}
137
138 \chapter{Methods}
139 %Methodology
140
141 %Experiment(s) (set-up, data, results, discussion)
142 \section{Data \& Preprocessing}
143 To run the experiments data has been collected from several \gls{dm} albums.
144 The exact data used is available in Appendix~\ref{app:data}. The albums are
145 extracted from the audio CD and converted to a mono channel waveform with the
146 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
147 Every file is annotated using
148 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
149 the audio. Examples of utterances are shown in
150 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
151 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
152 that within the genre of death metal there are a different spectral patterns
153 visible.
154
155 \begin{figure}[ht]
156 \centering
157 \includegraphics[width=.7\linewidth]{cement}
158 \caption{A vocal segment of the \emph{Cannibal Corpse} song
159 \emph{Bloodstained Cement}}\label{fig:bloodstained}
160 \end{figure}
161
162 \begin{figure}[ht]
163 \centering
164 \includegraphics[width=.7\linewidth]{abominations}
165 \caption{A vocal segment of the \emph{Disgorge} song
166 \emph{Enthroned Abominations}}\label{fig:abominations}
167 \end{figure}
168
169 The data is collected from three studio albums. The
170 first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
171 almost 25 years and have been creating the same type every album. The singer of
172 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
173 comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
174 regular shouting.
175
176 The second band is called \emph{Disgorge} and make even more violently sounding
177 music. The growls of the lead singer sound like a coffee grinder and are more
178 shallow. In the spectrals it is clearly visible that there are overtones
179 produced during some parts of the growling. The lyrics are completely
180 incomprehensible and therefore some parts were not annotated with the actual
181 lyrics because it was not possible what was being sung.
182
183 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
184 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
185 bands because they create \gls{dom}. \gls{dom} is characterized by the very
186 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
187 and performs in several moscovian bands. This band also stands out because it
188 uses piano's and synthesizers. The droning synthesizers often operate in the
189 same frequency as the vocals.
190
191 \section{\gls{MFCC} Features}
192 The waveforms are converted to \glspl{MFCC} feature vectors using the
193 \emph{python\_speech\_features}%
194 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
195 All these steps combined results in thirteen tab separated features per line in
196 a file for every source file. Technical info about the processing steps is
197 given in the following sections.
198
199 \todo{Explain why MFCC and which parameters}
200
201 \section{\gls{ANN} Classifier}
202 \todo{Spectrals might be enough, no decorrelation}
203
204 \section{Model training}
205
206 \section{Experiments}
207
208 \section{Results}
209
210
211 \chapter{Conclusion \& Discussion}
212 %Discussion section
213 \todo{Novelty}
214 \todo{Weaknesses}
215 \todo{Dataset is not very varied but\ldots}
216
217 \todo{Doom metal}
218 %Conclusion section
219 %Acknowledgements
220 %Statement on authors' contributions
221 %(Appendices)
222 \appendix
223 \chapter{Experimental data}\label{app:data}
224 \begin{table}[h]
225 \centering
226 \begin{tabular}{cllll}
227 \toprule
228 Num. & Artist & Album & Song & Duration\\
229 \midrule
230 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
231 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
232 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
233 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
234 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
235 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
236 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
237 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
238 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
239 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
240 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
241 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
242 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
243 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
244 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
245 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
246 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
247 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
248 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
249 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
250 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
251 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
252 22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\
253 23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\
254 24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\
255 25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\
256 26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\
257 27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\
258 28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\
259 \midrule
260 & & & Total: & 02:13:40\\
261 \bottomrule
262 \end{tabular}
263 \caption{Songs used in the experiments}
264 \end{table}
265
266 \bibliographystyle{ieeetr}
267 \bibliography{asr}
268 \end{document}