mfcc
[asr1617.git] / asr.tex
1 %&asr
2 \usepackage[nonumberlist,acronyms]{glossaries}
3 %\makeglossaries%
4 \newacronym{ANN}{ANN}{Artificial Neural Network}
5 \newacronym{HMM}{HMM}{Hidden Markov Model}
6 \newacronym{GMM}{GMM}{Gaussian Mixture Models}
7 \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
8 \newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
9 \newacronym{FA}{FA}{Forced alignment}
10 \newacronym{MFC}{MFC}{Mel-frequency cepstrum}
11 \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
12 \newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
13 \newglossaryentry{dm}{name={Death Metal},
14 description={is an extreme heavy metal music style with growling vocals and
15 pounding drums}}
16 \newglossaryentry{dom}{name={Doom Metal},
17 description={is an extreme heavy metal music style with growling vocals and
18 pounding drums played very slowly}}
19 \newglossaryentry{FT}{name={Fourier Transform},
20 description={is a technique of converting a time representation signal to a
21 frequency representation}}
22 \newglossaryentry{MS}{name={Mel-Scale},
23 description={is a human ear inspired scale for spectral signals.}}
24
25 \begin{document}
26 \frontmatter{}
27
28 \maketitleru[
29 course={(Automatic) Speech Recognition},
30 institute={Radboud University Nijmegen},
31 authorstext={Author:},
32 pagenr=1]
33 \listoftodos[Todo]
34
35 \tableofcontents
36
37 %Glossaries
38 %\glsaddall{}
39 %\printglossaries
40
41 \mainmatter{}
42 %Berenzweig and Ellis use acoustic classifiers from speech recognition as a
43 %detector for singing lines. They achive 80\% accuracy for forty 15 second
44 %exerpts. They mention people that wrote signal features that discriminate
45 %between speech and music. Neural net
46 %\glspl{HMM}~\cite{berenzweig_locating_2001}.
47 %
48 %In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
49 %polyphonic turkish music, this might be interesting to use for heavy metal.
50 %They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
51 %phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
52 %detection, then melody extraction, then alignment. They compare results with
53 %Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
54 %specialize in long syllables in a capella. They use \glspl{DHMM} with
55 %\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
56 %has long syllables)~\cite{dzhambazov_automatic_2016}.
57 %
58
59
60 %Introduction, leading to a clearly defined research question
61 \chapter{Introduction}
62 \section{Introduction}
63 The primary medium for music distribution is rapidly changing from physical
64 media to digital media. The \gls{IFPI} stated that about $43\%$ of music
65 revenue rises from digital distribution. Another $39\%$ arises from the
66 physical sale and the remaining $16\%$ is made through performance and
67 synchronisation revenieus. The overtake of digital formats on physical formats
68 took place somewhere in 2015. Moreover, ever since twenty years the music
69 industry has seen significant growth
70 again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}.
71
72 There has always been an interest in lyrics to music alignment to be used in
73 for example karaoke. As early as in the late 1980s karaoke machines were
74 available for consumers. While the lyrics for the track are almost always
75 available, a alignment is not and it involves manual labour to create such an
76 alignment.
77
78 A lot of this musical distribution goes via non-official channels such as
79 YouTube\footnote{\url{https://youtube.com}} in which fans of the performers
80 often accompany the music with synchronized lyrics. This means that there is an
81 enormous treasure of lyrics-annotated music available but not within our reach
82 since the subtitles are almost always hardcoded into the video stream and thus
83 not directly usable as data. Because of this interest it is very useful to
84 device automatic techniques for segmenting instrumental and vocal parts of a
85 song, apply forced alignment or even lyrics recognition on the audio file.
86
87 Such techniques are heavily researched and working systems have been created.
88 However, these techniques are designed to detect a clean singing voice and have
89 not been testen on so-called \emph{extended vocal techniques} such as grunting
90 or growling. Growling is heavily used in extreme metal genres such as \gls{dm}
91 but it must be noted that grunting is not a technique only used in extreme
92 metal styles. Similar or equal techniques have been used in \emph{Beijing
93 opera}, Japanese \emph{Noh} and but also more western styles like jazz singing
94 by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back
95 to viking times. For example, an arab merchant visiting a village in Denmark
96 wrote in the tenth century\cite{friis_vikings_2004}:
97
98 \begin{displayquote}
99 Never before I have heard uglier songs than those of the Vikings in
100 Slesvig. The growling sound coming from their throats reminds me of dogs
101 howling, only more untamed.
102 \end{displayquote}
103
104 \section{\gls{dm}}
105
106 %Literature overview / related work
107 \section{Related work}
108 The field of applying standard speech processing techniques on music started in
109 the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it
110 was found that music has different discriminating features compared to normal
111 speech.
112
113 Berenzweig and Ellis expanded on the aforementioned research by trying to
114 separate singing from instrumental music\cite{berenzweig_locating_2001}.
115
116 \todo{Incorporate this in literary framing}%
117 ~\cite{fujihara_automatic_2006}%
118 ~\cite{fujihara_lyricsynchronizer:_2011}%
119 ~\cite{fujihara_three_2008}%
120 ~\cite{mauch_integrating_2012}%
121 ~\cite{mesaros_adaptation_2009}%
122 ~\cite{mesaros_automatic_2008}%
123 ~\cite{mesaros_automatic_2010}%
124 ~%\cite{muller_multimodal_2012}%
125 ~\cite{pedone_phoneme-level_2011}%
126 ~\cite{yang_machine_2012}%
127
128
129
130 \section{Research question}
131 It is discutable whether the aforementioned techniques work because the
132 spectral properties of a growling voice is different from the spectral
133 properties of a clean singing voice. It has been found that growling voices
134 have less prominent peaks in the frequency representation and are closer to
135 noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
136 research question:
137
138 \begin{center}\em%
139 Are standard \gls{ANN} based techniques for singing voice detection
140 suitable for non-standard musical genres like \gls{dm}.
141 \end{center}
142
143 \chapter{Methods}
144 %Methodology
145
146 %Experiment(s) (set-up, data, results, discussion)
147 \section{Data \& Preprocessing}
148 To run the experiments data has been collected from several \gls{dm} albums.
149 The exact data used is available in Appendix~\ref{app:data}. The albums are
150 extracted from the audio CD and converted to a mono channel waveform with the
151 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
152 Every file is annotated using
153 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
154 the audio. Examples of utterances are shown in
155 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
156 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
157 that within the genre of death metal there are a different spectral patterns
158 visible.
159
160 \begin{figure}[ht]
161 \centering
162 \includegraphics[width=.7\linewidth]{cement}
163 \caption{A vocal segment of the \emph{Cannibal Corpse} song
164 \emph{Bloodstained Cement}}\label{fig:bloodstained}
165 \end{figure}
166
167 \begin{figure}[ht]
168 \centering
169 \includegraphics[width=.7\linewidth]{abominations}
170 \caption{A vocal segment of the \emph{Disgorge} song
171 \emph{Enthroned Abominations}}\label{fig:abominations}
172 \end{figure}
173
174 The data is collected from three studio albums. The
175 first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
176 almost 25 years and have been creating the same type every album. The singer of
177 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
178 comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
179 regular shouting.
180
181 The second band is called \emph{Disgorge} and make even more violently sounding
182 music. The growls of the lead singer sound like a coffee grinder and are more
183 shallow. In the spectrals it is clearly visible that there are overtones
184 produced during some parts of the growling. The lyrics are completely
185 incomprehensible and therefore some parts were not annotated with the actual
186 lyrics because it was not possible what was being sung.
187
188 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
189 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
190 bands because they create \gls{dom}. \gls{dom} is characterized by the very
191 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
192 and performs in several moscovian bands. This band also stands out because it
193 uses piano's and synthesizers. The droning synthesizers often operate in the
194 same frequency as the vocals.
195
196 \section{\gls{MFCC} Features}
197 The waveforms in itself are not very suitable to be used as features due to the
198 high dimensionality and correlation. Therefore we use the aften used
199 \glspl{MFCC} feature vectors.\todo{cite which papers use this} The actual
200 conversion is done using the \emph{python\_speech\_features}%
201 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
202
203 \gls{MFCC} features are nature inspired and built incrementally in a several of
204 steps.
205 \begin{enumerate}
206 \item The first step in the process is converting the time representation
207 of the signal to a spectral representation using a sliding window with
208 overlap. The width of the window and the step size are two important
209 parameters in the system. In classical phonetic analysis window sizes
210 of $25ms$ with a step of $10ms$ are often chosen because they are small
211 enough to only contain subphone entities. Singing for $25ms$ is
212 impossible so it is arguable that the window size is very small.
213 \item The standard \gls{FT} gives a spectral representation that has
214 linearly scaled frequencies. This scale is converted to the \gls{MS}
215 using triangular overlapping windows.
216 \item
217 \end{enumerate}
218
219
220 \todo{Explain why MFCC and which parameters}
221
222 \section{\gls{ANN} Classifier}
223 \todo{Spectrals might be enough, no decorrelation}
224
225 \section{Model training}
226
227 \section{Experiments}
228
229 \section{Results}
230
231
232 \chapter{Conclusion \& Discussion}
233 \section{Conclusion}
234 %Discussion section
235
236 \section{Discussion}
237
238 \todo{Novelty}
239 \todo{Weaknesses}
240 \todo{Dataset is not very varied but\ldots}
241
242 \todo{Doom metal}
243 %Conclusion section
244 %Acknowledgements
245 %Statement on authors' contributions
246 %(Appendices)
247 \appendix
248 \chapter{Experimental data}\label{app:data}
249 \begin{table}[h]
250 \centering
251 \begin{tabular}{cllll}
252 \toprule
253 Num. & Artist & Album & Song & Duration\\
254 \midrule
255 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\
256 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\
257 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\
258 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\
259 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\
260 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\
261 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\
262 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\
263 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\
264 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\
265 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\
266 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\
267 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\
268 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\
269 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\
270 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\
271 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\
272 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\
273 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\
274 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\
275 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\
276 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\
277 22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\
278 23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\
279 24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\
280 25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\
281 26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\
282 27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\
283 28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\
284 \midrule
285 & & & Total: & 02:13:40\\
286 \bottomrule
287 \end{tabular}
288 \caption{Songs used in the experiments}
289 \end{table}
290
291 \bibliographystyle{ieeetr}
292 \bibliography{asr}
293 \end{document}