add figures that illustrate model layout
[asr1617.git] / methods.tex
1 %Methodology
2
3 %Experiment(s) (set-up, data, results, discussion)
4 \section{Data \& Preprocessing}
5 To run the experiments data has been collected from several \gls{dm} albums.
6 The exact data used is available in Appendix~\ref{app:data}. The albums are
7 extracted from the audio CD and converted to a mono channel waveform with the
8 correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}.
9 Every file is annotated using
10 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to
11 the audio. Examples of utterances are shown in
12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
13 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
14 that within the genre of death metal there are a different spectral patterns
15 visible.
16
17 \begin{figure}[ht]
18 \centering
19 \includegraphics[width=.7\linewidth]{cement}
20 \caption{A vocal segment of the \emph{Cannibal Corpse} song
21 \emph{Bloodstained Cement}}\label{fig:bloodstained}
22 \end{figure}
23
24 \begin{figure}[ht]
25 \centering
26 \includegraphics[width=.7\linewidth]{abominations}
27 \caption{A vocal segment of the \emph{Disgorge} song
28 \emph{Enthroned Abominations}}\label{fig:abominations}
29 \end{figure}
30
31 The data is collected from three studio albums. The
32 first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for
33 almost 25 years and have been creating the same type every album. The singer of
34 \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite
35 comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering
36 regular shouting.
37
38 The second band is called \emph{Disgorge} and make even more violently sounding
39 music. The growls of the lead singer sound like a coffee grinder and are more
40 shallow. In the spectrals it is clearly visible that there are overtones
41 produced during some parts of the growling. The lyrics are completely
42 incomprehensible and therefore some parts were not annotated with the actual
43 lyrics because it was not possible what was being sung.
44
45 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
46 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
47 bands because they create \gls{dom}. \gls{dom} is characterized by the very
48 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
49 and performs in several Muscovite bands. This band also stands out because it
50 uses piano's and synthesizers. The droning synthesizers often operate in the
51 same frequency as the vocals.
52
53 The training and test data is divided as follows:
54 \begin{table}[H]
55 \centering
56 \begin{tabular}{lcc}
57 \toprule
58 Singing & Instrumental\\
59 \midrule
60 0.59 & 0.41\\
61 \bottomrule
62 \end{tabular}
63 \quad
64 \begin{tabular}{lcccc}
65 \toprule
66 Instrumental & CC & DG & WDISS\\
67 \midrule
68 \bottomrule
69 \end{tabular}
70 \end{table}
71
72 \section{\gls{MFCC} Features}
73 The waveforms in itself are not very suitable to be used as features due to the
74 high dimensionality and correlation. Therefore we use the often used
75 \glspl{MFCC} feature vectors which has shown to be
76 suitable\cite{rocamora_comparing_2007}. It has also been found that altering
77 the mel scale to better suit singing does not yield a better
78 performance\cite{you_comparative_2015}. The actual conversion is done using the
79 \emph{python\_speech\_features}%
80 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
81
82 \gls{MFCC} features are nature inspired and built incrementally in several
83 steps.
84 \begin{enumerate}
85 \item The first step in the process is converting the time representation
86 of the signal to a spectral representation using a sliding window with
87 overlap. The width of the window and the step size are two important
88 parameters in the system. In classical phonetic analysis window sizes
89 of $25ms$ with a step of $10ms$ are often chosen because they are small
90 enough to only contain subphone entities. Singing for $25ms$ is
91 impossible so it is arguable that the window size is very small.
92 \item The standard \gls{FT} gives a spectral representation that has
93 linearly scaled frequencies. This scale is converted to the \gls{MS}
94 using triangular overlapping windows.
95 \item The log is taken of the Mel frequencies. This step is inspired by the
96 \emph{Weber-Fechner} law that describes how humans perceive physical
97 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
98 Psychophysik}
99 \item To decorrelate the signal a \gls{DCT} is applied. The \gls{MFCC}
100 features are then the amplitudes of the spectrum.
101 \end{enumerate}
102
103 \section{\gls{ANN} Classifier}
104 \todo{Spectrals might be enough, no decorrelation}
105
106 \section{Experiments}
107 \subsection{\emph{Singing} voice detection}
108 The first type of experiment conducted is \emph{Singing} voice detection. This
109 is the act of segmenting an audio signal into segments that are labeled either
110 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
111 feature vector and the output is the probability that singing is happening in
112 the sample.
113
114 \begin{figure}[H]
115 \centering
116 \includegraphics[width=.5\textwidth]{bcann}
117 \caption{Binary classifier network architecture}\label{fig:bcann}
118 \end{figure}
119
120 \subsection{\emph{Singer} voice detection}
121 The second type of experiment conducted is \emph{Singer} voice detection. This
122 is the act of segmenting an audio signal into segments that are labeled either
123 with the name of the singer or as \emph{Instrumental}. The input of the
124 classifier is a feature vector and the outputs are probabilities for each of
125 the singers and a probability for the instrumental label.
126
127 \section{Results}
128