up
[asr1617.git] / asr.tex
1 %&asr
2 \usepackage[nonumberlist,acronyms]{glossaries}
3 %\makeglossaries%
4 \newacronym{ANN}{ANN}{Artificial Neural Network}
5 \newacronym{HMM}{HMM}{Hidden Markov Model}
6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
9 \newacronym{FA}{FA}{Forced alignment}
10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
13 \newglossaryentry{dm}{name={Death Metal},
14 description={is an extreme heavy metal music style with growling vocals and
15 pounding drums}}
16
17 \begin{document}
18 \frontmatter{}
19
20 \maketitleru[
21 course={(Automatic) Speech Recognition},
22 institute={Radboud University Nijmegen},
23 authorstext={Author:},
24 pagenr=1]
25 \listoftodos[Todo]
26
27 \tableofcontents
28
29 %Glossaries
30 %\glsaddall{}
31 %\printglossaries
32
33 \mainmatter{}
34 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
35 %detector for singing lines. They achive 80\% accuracy for forty 15 second
36 %exerpts. They mention people that wrote signal features that discriminate
37 %between speech and music. Neural net
38 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
39 %
40 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
41 %polyphonic turkish music, this might be interesting to use for heavy metal.
42 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
43 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
44 %detection, then melody extraction, then alignment. They compare results with
45 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
46 %specialize in long syllables in a capella. They use \glspl{DHMM} with
47 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
48 %has long syllables)~\cite{dzhambazov_automatic_2016}.
49 %
50
51
52 %Introduction, leading to a clearly defined research question
53 \chapter{Introduction}
54 \section{Introduction}
55 The primary medium for music distribution is rapidly changing from physical
56 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
57 revenue rises from digital distribution. Another $39\%$ arises from the
58 physical sale and the remaining $16\%$ is made through performance and
59 synchronisation revenieus. The overtake of digital formats on physical formats
60 took place somewhere in 2015. Moreover, ever since twenty years the music
61 industry has seen significant growth
62 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
63
64 There has always been an interest in lyrics to music alignment to be used in
65 for example karaoke. As early as in the late 1980s karaoke machines were
66 available for consumers. While the lyrics for the track are almost always
67 available, a alignment is not and it involves manual labour to create such an
68 alignment.
69
70 A lot of this musical distribution goes via non-official channels such as
71 YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
72 often accompany the music with synchronized lyrics. This means that there is an
73 enormous treasure of lyrics-annotated music available but not within our reach
74 since the subtitles are almost always hardcoded into the video stream and thus
75 not directly usable as data. Because of this interest it is very useful to
76 device automatic techniques for segmenting instrumental and vocal parts of a
77 song, apply forced alignment or even lyrics recognition on the audio file.
78
79 Such techniques are heavily researched and working systems have been created.
80 However, these techniques are designed to detect a clean singing voice and have
81 not been testen on so-called \emph{extended vocal techniques} such as grunting
82 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
83 but it must be noted that grunting is not a technique only used in extreme
84 metal styles. Similar or equal techniques have been used in \emph{Beijing
85 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
86 by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
87 to viking times. For example, an arab merchant visiting a village in Denmark
88 wrote in the tenth century\cite{friis_vikings_2004}:
89
90 \begin{displayquote}
91 Never before I have heard uglier songs than those of the Vikings in
92 Slesvig. The growling sound coming from their throats reminds me of dogs
93 howling, only more untamed.
94 \end{displayquote}
95
96 \section{\gls{dm}}
97
98 %Literature overview / related work
99 \section{Related work}
100 The field of applying standard speech processing techniques on music started in
101 the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
102 was found that music has different discriminating features compared to normal
103 speech.
104
105 Berenzweig and Ellis expanded on the aforementioned research by trying to
106 separate singing from instrumental music\cite{berenzweig_locating_2001}.
107
108 \todo{Incorporate this in literary framing}%
109 ~\cite{fujihara_automatic_2006}%
110 ~\cite{fujihara_lyricsynchronizer:_2011}%
111 ~\cite{fujihara_three_2008}%
112 ~\cite{mauch_integrating_2012}%
113 ~\cite{mesaros_adaptation_2009}%
114 ~\cite{mesaros_automatic_2008}%
115 ~\cite{mesaros_automatic_2010}%
116 ~%\cite{muller_multimodal_2012}%
117 ~\cite{pedone_phoneme-level_2011}%
118 ~\cite{yang_machine_2012}%
119
120
121
122 \section{Research question}
123 It is discutable whether the aforementioned techniques work because the
124 spectral properties of a growling voice is different from the spectral
125 properties of a clean singing voice. It has been found that growling voices
126 have less prominent peaks in the frequency representation and are closer to
127 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
128 research question:
129
130 \begin{center}\em%
131 Are standard \gls{ANN} based techniques for singing voice detection
132 suitable for non-standard musical genres like \gls{dm}.
133 \end{center}
134
135 \chapter{Methods}
136 %Methodology
137
138 %Experiment(s) (set-up, data, results, discussion)
139 \section{Data \& Preprocessing}
140 To run the experiments data has been collected from several \gls{dm} albums.
141 The exact data used is available in Appendix~\ref{app:data}. The albums are
142 extracted from the audio CD and converted to a mono channel waveform with the
143 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
144 When the waveforms are finished they are converted to \glspl{MFCC} vectors
145 using the \emph{python\_speech\_features}%
146 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
147 All these steps combined results in thirteen tab separated features per line in
148 a file for every source file. Technical info about the processing steps is
149 given in the following sections. Every file is annotated using
150 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
151 the audio. Examples of utterances are shown in
152 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
153 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
154 that within the genre of death metal there are a different spectral patterns
155 visible.
156
157 \begin{figure}[ht]
158 \centering
159 \includegraphics[width=.7\linewidth]{cement}
160 \caption{A vocal segment of the \emph{Cannibal Corpse} song
161 \emph{Bloodstained Cement}}\label{fig:bloodstained}
162 \end{figure}
163
164 \begin{figure}[ht]
165 \centering
166 \includegraphics[width=.7\linewidth]{abominations}
167 \caption{A vocal segment of the \emph{Disgorge} song
168 \emph{Enthroned Abominations}}\label{fig:abominations}
169 \end{figure}
170
171 The data is collected from two\todo{more in the future}\ studio albums. The first
172 band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
173 25 years and have been creating the same type every album. The singer of
174 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
175 comprehensible. The second band is called \emph{Disgorge} and make even more
176 violent music. The growls of the lead singer sound more like a coffee grinder
177 and are more shallow. The lyrics are completely incomprehensible and therefore
178 some parts are not annotated with lyrics because it was too difficult to hear
179 what was being sung.
180
181 \section{Methods}
182 \todo{To remove in final thesis}
183 The initial planning is still up to date. About one and a half album has been
184 annotated and a framework for setting up experiments has been created.
185 Moreover, the first exploratory experiments are already been executed and
186 promising. In April the experimental dataset will be expanded and I will try to
187 mimic some of the experiments done in the literature to see whether it performs
188 similar on Death Metal
189 \begin{table}[ht]
190 \centering
191 \begin{tabular}{cll}
192 \toprule
193 Month & Description\\
194 \midrule
195 March
196 & Preparing the data\\
197 & Preparing an experiment platform\\
198 & Literature research\\
199 April
200 & Running the experiments\\
201 & Fiddle with parameters\\
202 & Explore the possibilities for forced alignment\\
203 May
204 & Write up the thesis\\
205 & Possibly do forced alignment\\
206 June
207 & Finish up thesis\\
208 & Wrap up\\
209 \bottomrule
210 \end{tabular}
211 \caption{Outline}
212 \end{table}
213
214 \section{Features}
215
216
217 \todo{Explain why MFCC and which parameters}
218 \todo{Spectrals might be enough, no decorrelation}
219
220 \section{Experiments}
221
222 \section{Results}
223
224
225 \chapter{Conclusion \& Discussion}
226 %Discussion section
227 \todo{Novelty}
228 \todo{Weaknesses}
229 \todo{Dataset is not very varied but\ldots}
230
231 \todo{Doom metal}
232 %Conclusion section
233 %Acknowledgements
234 %Statement on authors' contributions
235 %(Appendices)
236 \appendix
237 \chapter{Experimental data}\label{app:data}
238 \begin{table}[h]
239 \centering
240 \begin{tabular}{cllll}
241 \toprule
242 Num. & Artist & Album & Song & Duration\\
243 \midrule
244 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
245 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
246 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
247 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
248 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
249 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
250 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
251 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
252 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
253 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
254 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
255 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
256 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
257 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
258 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
259 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
260 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
261 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
262 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
263 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
264 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
265 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
266 \bottomrule
267 \end{tabular}
268 \caption{Songs used in the experiments}
269 \end{table}
270
271 \bibliographystyle{ieeetr}
272 \bibliography{asr}
273 \end{document}