a15c4211c21b8b90ebf12a6ae93ddffe2e2473e0
[asr1617.git] / methods.tex
1 %Methodology
2
3 %Experiment(s) (set-up, data, results, discussion)
4 \section{Data \& Preprocessing}
5 To run the experiments data has been collected from several \gls{dm} albums.
6 The exact data used is available in Appendix~\ref{app:data}. The albums are
7 extracted from the audio CD and converted to a mono channel waveform with the
8 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
9 Every file is annotated using
10 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
11 the audio. Examples of utterances are shown in
12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
13 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
14 that within the genre of death metal there are a different spectral patterns
15 visible.
16
17 \begin{figure}[ht]
18 \centering
19 \includegraphics[width=.7\linewidth]{cement}
20 \caption{A vocal segment of the \emph{Cannibal Corpse} song
21 \emph{Bloodstained Cement}}\label{fig:bloodstained}
22 \end{figure}
23
24 \begin{figure}[ht]
25 \centering
26 \includegraphics[width=.7\linewidth]{abominations}
27 \caption{A vocal segment of the \emph{Disgorge} song
28 \emph{Enthroned Abominations}}\label{fig:abominations}
29 \end{figure}
30
31 The data is collected from three studio albums. The
32 first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
33 almost 25 years and have been creating the same type every album. The singer of
34 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
35 comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
36 regular shouting.
37
38 The second band is called \emph{Disgorge} and make even more violently sounding
39 music. The growls of the lead singer sound like a coffee grinder and are more
40 shallow. In the spectrals it is clearly visible that there are overtones
41 produced during some parts of the growling. The lyrics are completely
42 incomprehensible and therefore some parts were not annotated with the actual
43 lyrics because it was not possible what was being sung.
44
45 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
46 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
47 bands because they create \gls{dom}. \gls{dom} is characterized by the very
48 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
49 and performs in several Muscovite bands. This band also stands out because it
50 uses piano's and synthesizers. The droning synthesizers often operate in the
51 same frequency as the vocals.
52
53 \section{\gls{MFCC} Features}
54 The waveforms in itself are not very suitable to be used as features due to the
55 high dimensionality and correlation. Therefore we use the often used
56 \glspl{MFCC} feature vectors which has shown to be
57 suitable\cite{rocamora_comparing_2007}. It has also been found that altering
58 the mel scale to better suit singing does not yield a better
59 performance\cite{you_comparative_2015}. The actual conversion is done using the
60 \emph{python\_speech\_features}%
61 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
62
63 \gls{MFCC} features are nature inspired and built incrementally in several
64 steps.
65 \begin{enumerate}
66 \item The first step in the process is converting the time representation
67 of the signal to a spectral representation using a sliding window with
68 overlap. The width of the window and the step size are two important
69 parameters in the system. In classical phonetic analysis window sizes
70 of $25ms$ with a step of $10ms$ are often chosen because they are small
71 enough to only contain subphone entities. Singing for $25ms$ is
72 impossible so it is arguable that the window size is very small.
73 \item The standard \gls{FT} gives a spectral representation that has
74 linearly scaled frequencies. This scale is converted to the \gls{MS}
75 using triangular overlapping windows.
76 \item The log is taken of the Mel frequencies. This step is inspired by the
77 \emph{Weber-Fechner} law that describes how humans perceive physical
78 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
79 Psychophysik}
80 \item To decorrelate the signal a \gls{DCT} is applied. The \gls{MFCC}
81 features are then the amplitudes of the spectrum.
82 \end{enumerate}
83
84 \section{\gls{ANN} Classifier}
85 \todo{Spectrals might be enough, no decorrelation}
86
87 \section{Model training}
88
89 \section{Experiments}
90
91 \section{Results}
92