brush up intro
[asr1617.git] / asr.tex
1 %&asr
2 \usepackage[nonumberlist,acronyms]{glossaries}
3 \makeglossaries%
4 \newacronym{ANN}{ANN}{Artificial Neural Network}
5 \newacronym{HMM}{HMM}{Hidden Markov Model}
6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
9 \newacronym{FA}{FA}{Forced alignment}
10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
13 \newglossaryentry{dm}{name={Death Metal},
14 description={is an extreme heavy metal music style with growling vocals and
15 pounding drums}}
16
17 \begin{document}
18 \frontmatter{}
19
20 \maketitleru[
21 course={(Automatic) Speech Recognition},
22 institute={Radboud University Nijmegen},
23 authorstext={Author:},
24 pagenr=1]
25 \listoftodos[Todo]
26
27 \tableofcontents
28
29 %Glossaries
30 %\glsaddall{}
31 %\printglossaries
32
33 \mainmatter{}
34 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
35 %detector for singing lines. They achive 80\% accuracy for forty 15 second
36 %exerpts. They mention people that wrote signal features that discriminate
37 %between speech and music. Neural net
38 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
39 %
40 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
41 %polyphonic turkish music, this might be interesting to use for heavy metal.
42 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
43 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
44 %detection, then melody extraction, then alignment. They compare results with
45 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
46 %specialize in long syllables in a capella. They use \glspl{DHMM} with
47 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
48 %has long syllables)~\cite{dzhambazov_automatic_2016}.
49 %
50
51
52 %Introduction, leading to a clearly defined research question
53 \chapter{Introduction}
54 \section{Introduction}
55 The primary medium for music distribution is rapidly changing from physical
56 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
57 revenue rises from digital distribution. Another $39\%$ arises from the
58 physical sale and the remaining $16\%$ is made through performance and
59 synchronisation revenieus. The overtake of digital formats on physical formats
60 took place somewhere in 2015. Moreover, ever since twenty years the music
61 industry has seen significant growth
62 again~\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
63
64 There has always been an interest in lyrics to music alignment to be used in
65 for example karaoke. As early as in the late 1980s karaoke machines were
66 available for consumers. While the lyrics for the track are almost always
67 available, a alignment is not and it involves manual labour to create such an
68 alignment.
69
70 A lot of this musical distribution goes via non-official channels such as
71 YouTube~\footnote{\url{https://youtube.com}} in which fans of the performers
72 often accompany the music with synchronized lyrics. This means that there is an
73 enormous treasure of lyrics-annotated music available but not within our reach
74 since the subtitles are almost always hardcoded into the video stream and thus
75 not directly usable as data. Because of this interest it is very useful to
76 device automatic techniques for segmenting instrumental and vocal parts of a
77 song, apply forced alignment or even lyrics recognition on the audio file.
78
79 Such techniques are heavily researched and working systems have been created.
80 However, these techniques are designed to detect a clean singing voice and have
81 not been testen on so-called \emph{extended vocal techniques} such as grunting
82 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
83 but it must be noted that grunting is not a technique only used in extreme
84 metal styles. Similar or equal techniques have been used in \emph{Beijing
85 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
86 by Louis Armstrong~\cite{sakakibara_growl_2004}. It might even be traced back
87 to viking times. For example, an arab merchant visiting a village in Denmark
88 wrote in the tenth century~\cite{friis_vikings_2004}:
89
90 \begin{displayquote}
91 Never before I have heard uglier songs than those of the Vikings in
92 Slesvig. The growling sound coming from their throats reminds me of dogs
93 howling, only more untamed.
94 \end{displayquote}
95
96 \section{\gls{dm}}
97
98 %Literature overview / related work
99 \section{Related work}
100 The field of applying standard speech processing techniques on music started in
101 the late 90s~\cite{saunders_real-time_1996,scheirer_construction_1997} and it
102 was found that music has different discriminating features compared to normal
103 speech.
104
105 Berenzweig and Ellis expanded on the aforementioned research by trying to
106 separate singing from instrumental music\cite{berenzweig_locating_2001}.
107
108 \todo{Incorporate this in literary framing}%
109 ~\cite{fujihara_automatic_2006}%
110 ~\cite{fujihara_lyricsynchronizer:_2011}%
111 ~\cite{fujihara_three_2008}%
112 ~\cite{mauch_integrating_2012}%
113 ~\cite{mesaros_adaptation_2009}%
114 ~\cite{mesaros_automatic_2008}%
115 ~\cite{mesaros_automatic_2010}%
116 ~%\cite{muller_multimodal_2012}%
117 ~\cite{pedone_phoneme-level_2011}%
118 ~\cite{yang_machine_2012}%
119
120
121
122 \section{Research question}
123 It is discutable whether the aforementioned techniques work because the
124 spectral properties of a growling voice is different from the spectral
125 properties of a clean singing voice. It has been found that growling voices
126 have less prominent peaks in the frequency representation and are closer to
127 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
128 research question:
129
130 \begin{center}\em%
131 Are standard \gls{ANN} based techniques for singing voice detection
132 suitable for non-standard musical genres like \gls{dm}.
133 \end{center}
134
135 \chapter{Methods}
136 %Methodology
137
138 %Experiment(s) (set-up, data, results, discussion)
139 \section{Data \& Preprocessing}
140 To run the experiments data has been collected from several \gls{dm} albums.
141 The exact data used is available in Appendix~\ref{app:data}. The albums are
142 extracted from the audio CD and converted to a mono channel waveform with the
143 correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
144 When the waveforms are finished they are converted to \glspl{MFCC} vectors
145 using the \emph{python\_speech\_features}%
146 ~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
147 All these steps combined results in thirteen tab separated features per line in
148 a file for every source file. Every file is annotated using
149 Praat~\cite{boersma_praat_2002} where the utterances are manually aligned to
150 the audio. Examples of utterances are shown in
151 Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
152 within the genre of death metal there are a different spectral patterns
153 visible.
154
155 \begin{figure}[ht]
156 \centering
157 \includegraphics[width=.7\linewidth]{cement}
158 \caption{A vocal segment of the \emph{Cannibal Corpse} song
159 \emph{Bloodstained Cement}}\label{fig:bloodstained}
160 \end{figure}
161
162 \begin{figure}[ht]
163 \centering
164 \includegraphics[width=.7\linewidth]{abominations}
165 \caption{A vocal segment of the \emph{Disgorge} song
166 \emph{Enthroned Abominations}}\label{fig:abominations}
167 \end{figure}
168
169 The data is collected from two\todo{more in the future}\ studio albums. The first
170 band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
171 25 years and have been creating the same type every album. The singer of
172 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
173 comprehensible. The second band is called \emph{Disgorge} and make even more
174 violent music. The growls of the lead singer sound more like a coffee grinder
175 and are more shallow. The lyrics are completely incomprehensible and therefore
176 some parts are not annotated with lyrics because it was too difficult to hear
177 what was being sung.
178
179 \section{Methods}
180 \todo{To remove in final thesis}
181 The initial planning is still up to date. About one and a half album has been
182 annotated and a framework for setting up experiments has been created.
183 Moreover, the first exploratory experiments are already been executed and
184 promising. In April the experimental dataset will be expanded and I will try to
185 mimic some of the experiments done in the literature to see whether it performs
186 similar on Death Metal
187 \begin{table}[ht]
188 \centering
189 \begin{tabular}{cll}
190 \toprule
191 Month & Description\\
192 \midrule
193 March
194 & Preparing the data\\
195 & Preparing an experiment platform\\
196 & Literature research\\
197 April
198 & Running the experiments\\
199 & Fiddle with parameters\\
200 & Explore the possibilities for forced alignment\\
201 May
202 & Write up the thesis\\
203 & Possibly do forced alignment\\
204 June
205 & Finish up thesis\\
206 & Wrap up\\
207 \bottomrule
208 \end{tabular}
209 \caption{Outline}
210 \end{table}
211
212 \todo{Explain why MFCC and which parameters}
213 \todo{Spectrals might be enough, no decorrelation}
214
215 \section{Experiments}
216
217 \section{Results}
218
219
220 \chapter{Conclusion \& Discussion}
221 %Discussion section
222 \todo{Novelty}
223 \todo{Weaknesses}
224 \todo{Dataset is not very varied but\ldots}
225
226 \todo{Doom metal}
227 %Conclusion section
228 %Acknowledgements
229 %Statement on authors' contributions
230 %(Appendices)
231 \appendix
232 \chapter{Experimental data}\label{app:data}
233 \begin{table}[h]
234 \centering
235 \begin{tabular}{cllll}
236 \toprule
237 Num. & Artist & Album & Song & Duration\\
238 \midrule
239 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
240 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
241 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
242 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
243 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
244 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
245 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
246 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
247 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
248 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
249 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
250 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
251 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
252 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
253 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
254 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
255 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
256 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
257 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
258 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
259 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
260 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
261 \bottomrule
262 \end{tabular}
263 \caption{Songs used in the experiments}
264 \end{table}
265
266 \bibliographystyle{ieeetr}
267 \bibliography{asr}
268 \end{document}