2b812d12f85d192140e285010e5b280524177755
[asr1617.git] / asr.tex
1 %&asr
2 \usepackage[nonumberlist,acronyms]{glossaries}
3 \makeglossaries%
4 \newacronym{ANN}{ANN}{Artificial Neural Network}
5 \newacronym{HMM}{HMM}{Hidden Markov Model}
6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
9 \newacronym{FA}{FA}{Forced alignment}
10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
13 \newglossaryentry{dm}{name={Death Metal},
14 description={is an extreme heavy metal music style with growling vocals and
15 pounding drums}}
16
17 \begin{document}
18 \frontmatter{}
19
20 \maketitleru[
21 course={(Automatic) Speech Recognition},
22 institute={Radboud University Nijmegen},
23 authorstext={Author:},
24 pagenr=1]
25 \listoftodos[Todo]
26
27 \tableofcontents
28
29 %Glossaries
30 %\glsaddall{}
31 %\printglossaries
32
33 \mainmatter{}
34 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
35 %detector for singing lines. They achive 80\% accuracy for forty 15 second
36 %exerpts. They mention people that wrote signal features that discriminate
37 %between speech and music. Neural net
38 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
39 %
40 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
41 %polyphonic turkish music, this might be interesting to use for heavy metal.
42 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
43 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
44 %detection, then melody extraction, then alignment. They compare results with
45 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
46 %specialize in long syllables in a capella. They use \glspl{DHMM} with
47 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
48 %has long syllables)~\cite{dzhambazov_automatic_2016}.
49 %
50
51
52 %Introduction, leading to a clearly defined research question
53 \chapter{Introduction}
54 \section{Introduction}
55 The \gls{IFPI} stated that about $43\%$ of music revenue rises from digital
56 distribution. The overtake on physical formats took place somewhere in 2015 and
57 since twenty years the music industry has seen significant
58 growth~\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
59
60 A lot of this musical distribution goes via non-official channels such as
61 YouTube~\footnote{\url{https://youtube.com}} in which fans of the musical group
62 accompany the music with synchronized lyrics so that users can sing or read
63 along. Because of this interest it is very useful to device automatic
64 techniques for segmenting instrumental and vocal parts of a song and
65 apply forced alignment or even lyrics recognition on the audio file.
66
67 Such techniques are heavily researched and working systems have been created.
68 However, these techniques are designed to detect a clean singing voice. Extreme
69 genres such as \gls{dm} are using more extreme vocal techniques such as
70 grunting or growling. It must be noted that grunting is not a technique only
71 used in extreme metal styles. Similar or equal techniques have been used in
72 \emph{Beijing opera}, Japanese \emph{Noh} and but also more western styles like
73 jazz singing by Louis Armstrong~\cite{sakakibara_growl_2004}. It might even be
74 traced back to viking times. An arab merchant wrote in the tenth
75 century~\cite{friis_vikings_2004}:
76
77 \begin{displayquote}
78 Never before I have heard uglier songs than those of the Vikings in
79 Slesvig. The growling sound coming from their throats reminds me of dogs
80 howling, only more untamed.
81 \end{displayquote}
82
83 %A majority of the music is not only instrumental but also contains vocal
84 %segments.
85 %
86 %Music is a leading type of data distributed on the internet. Regular music
87 %distribution is almost entirely digital and services like Spotify and YouTube
88 %allow one to listen to almost any song within a few clicks. Moreover, there are
89 %myriads of websites offering lyrics of songs.
90 %
91 %\todo{explain relevancy, (preprocessing for lyric alignment)}
92 %
93 %This leads to the following research question:
94 %\begin{center}\em%
95 % Are standard \gls{ANN} based techniques for singing voice detection
96 % suitable for non-standard musical genres like Death metal.
97 %\end{center}
98
99 %Literature overview / related work
100 \section{Related work}
101 The field of applying standard speech processing techniques on music started in
102 the late 90s~\cite{saunders_real-time_1996,scheirer_construction_1997} and it
103 was found that music has different discriminating features compared to normal
104 speech.
105
106 Berenzweig and Ellis expanded on the aforementioned research by trying to
107 separate singing from instrumental music\cite{berenzweig_locating_2001}.
108
109 \todo{Incorporate this in literary framing}%
110 ~\cite{fujihara_automatic_2006}%
111 ~\cite{fujihara_lyricsynchronizer:_2011}%
112 ~\cite{fujihara_three_2008}%
113 ~\cite{mauch_integrating_2012}%
114 ~\cite{mesaros_adaptation_2009}%
115 ~\cite{mesaros_automatic_2008}%
116 ~\cite{mesaros_automatic_2010}%
117 ~%\cite{muller_multimodal_2012}%
118 ~\cite{pedone_phoneme-level_2011}%
119 ~\cite{yang_machine_2012}%
120
121
122
123 \section{Research question}
124 It is discutable whether the aforementioned techniques work because the
125 spectral properties of a growling voice is different from the spectral
126 properties of a clean singing voice. It has been found that growling voices
127 have less prominent peaks in the frequency representation and are closer to
128 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
129 research question:
130
131 \begin{center}\em%
132 Are standard \gls{ANN} based techniques for singing voice detection
133 suitable for non-standard musical genres like \gls{dm}.
134 \end{center}
135
136 \chapter{Methods}
137 %Methodology
138
139 %Experiment(s) (set-up, data, results, discussion)
140 \section{Data \& Preprocessing}
141 To run the experiments data has been collected from several \gls{dm} albums.
142 The exact data used is available in Appendix~\ref{app:data}. The albums are
143 extracted from the audio CD and converted to a mono channel waveform with the
144 correct samplerate \emph{SoX}~\footnote{\url{http://sox.sourceforge.net/}}.
145 When the waveforms are finished they are converted to \glspl{MFCC} vectors
146 using the \emph{python\_speech\_features}%
147 ~\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
148 All these steps combined results in thirteen tab separated features per line in
149 a file for every source file. Every file is annotated using
150 Praat~\cite{boersma_praat_2002} where the utterances are manually aligned to
151 the audio. An example of an utterances are shown in
152 Figures~\ref{fig:bloodstained,fig:abominations}. It is clearly visible that
153 within the genre of death metal there are a lot of different spectral patterns
154 visible.
155
156 \begin{figure}[ht]
157 \centering
158 \includegraphics[width=.7\linewidth]{cement}
159 \caption{A vocal segment of the \emph{Cannibal Corpse} song
160 \emph{Bloodstained Cement}}\label{fig:bloodstained}
161 \end{figure}
162
163 \begin{figure}[ht]
164 \centering
165 \includegraphics[width=.7\linewidth]{abominations}
166 \caption{A vocal segment of the \emph{Disgorge} song
167 \emph{Enthroned Abominations}}\label{fig:abominations}
168 \end{figure}
169
170 The data is collected from two\todo{more in the future}\ studio albums. The first
171 band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost
172 25 years and have been creating the same type every album. The singer of
173 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
174 comprehensible. The second band is called \emph{Disgorge} and make even more
175 violent music. The growls of the lead singer sound more like a coffee grinder
176 and are more shallow. The lyrics are completely incomprehensible and therefore
177 some parts are not annotated with lyrics because it was too difficult to hear
178 what was being sung.
179
180 \section{Methods}
181 \todo{To remove in final thesis}
182 The initial planning is still up to date. About one and a half album has been
183 annotated and a framework for setting up experiments has been created.
184 Moreover, the first exploratory experiments are already been executed and
185 promising. In April the experimental dataset will be expanded and I will try to
186 mimic some of the experiments done in the literature to see whether it performs
187 similar on Death Metal
188 \begin{table}[ht]
189 \centering
190 \begin{tabular}{cll}
191 \toprule
192 Month & Description\\
193 \midrule
194 March
195 & Preparing the data\\
196 & Preparing an experiment platform\\
197 & Literature research\\
198 April
199 & Running the experiments\\
200 & Fiddle with parameters\\
201 & Explore the possibilities for forced alignment\\
202 May
203 & Write up the thesis\\
204 & Possibly do forced alignment\\
205 June
206 & Finish up thesis\\
207 & Wrap up\\
208 \bottomrule
209 \end{tabular}
210 \caption{Outline}
211 \end{table}
212
213 \todo{Explain why MFCC and which parameters}
214 \todo{Spectrals might be enough, no decorrelation}
215
216 \section{Experiments}
217
218 \section{Results}
219
220
221 \chapter{Conclusion \& Discussion}
222 %Discussion section
223 \todo{Novelty}
224 \todo{Weaknesses}
225 \todo{Dataset is not very varied but\ldots}
226
227 \todo{Doom metal}
228 %Conclusion section
229 %Acknowledgements
230 %Statement on authors' contributions
231 %(Appendices)
232 \appendix
233 \chapter{Experimental data}\label{app:data}
234 \begin{table}[h]
235 \centering
236 \begin{tabular}{cllll}
237 \toprule
238 Num. & Artist & Album & Song & Duration\\
239 \midrule
240 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
241 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
242 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
243 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
244 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
245 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
246 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
247 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
248 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
249 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
250 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
251 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
252 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
253 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
254 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
255 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
256 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
257 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
258 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
259 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
260 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
261 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
262 \bottomrule
263 \end{tabular}
264 \caption{Songs used in the experiments}
265 \end{table}
266
267 \bibliographystyle{ieeetr}
268 \bibliography{asr}
269 \end{document}