elaborate on sections
[asr1617.git] / methods.tex
1 %Methodology
2
3 %Experiment(s) (set-up, data, results, discussion)
4 \section{Data \& Preprocessing}
5 To run the experiments data has been collected from several \gls{dm} albums.
6 The exact data used is available in Appendix~\ref{app:data}. The albums are
7 extracted from the audio CD and converted to a mono channel waveform with the
8 correct samplerate utilizing \emph{SoX}%
9 \footnote{\url{http://sox.sourceforge.net/}}. Every file is annotated using
10 Praat\cite{boersma_praat_2002} where the utterances are manually aligned to the
11 audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and
12 Figure~\ref{fig:abominations} where the waveform, $1-8000$Hz spectrals and
13 annotations are shown. It is clearly visible that within the genre of death
14 metal there are different spectral patterns visible over time.
15
16 \begin{figure}[ht]
17 \centering
18 \includegraphics[width=.7\linewidth]{cement}
19 \caption{A vocal segment of the \emph{Cannibal Corpse} song
20 \emph{Bloodstained Cement}}\label{fig:bloodstained}
21 \end{figure}
22
23 \begin{figure}[ht]
24 \centering
25 \includegraphics[width=.7\linewidth]{abominations}
26 \caption{A vocal segment of the \emph{Disgorge} song
27 \emph{Enthroned Abominations}}\label{fig:abominations}
28 \end{figure}
29
30 The data is collected from three studio albums. The first band is called
31 \emph{Cannibal Corpse} and has been producing \gls{dm} for almost 25 years and
32 have been creating the same type every album. The singer of \emph{Cannibal
33 Corpse} has a very raspy growls and the lyrics are quite comprehensible. The
34 vocals produced by \emph{Cannibal Corpse} are bordering regular shouting.
35
36 The second band is called \emph{Disgorge} and make even more violently sounding
37 music. The growls of the lead singer sound like a coffee grinder and are more
38 shallow. In the spectrals it is clearly visible that there are overtones
39 produced during some parts of the growling. The lyrics are completely
40 incomprehensible and therefore some parts were not annotated with the actual
41 lyrics because it was not possible what was being sung.
42
43 Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
44 Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
45 bands because they create \gls{dom}. \gls{dom} is characterized by the very
46 slow tempo and low tuned guitars. The vocalist has a very characteristic growl
47 and performs in several Muscovite bands. This band also stands out because it
48 uses piano's and synthesizers. The droning synthesizers often operate in the
49 same frequency as the vocals.
50
51 The training and test data is divided as follows:
52 \begin{table}[H]
53 \centering
54 \begin{tabular}{lcc}
55 \toprule
56 Singing & Instrumental\\
57 \midrule
58 0.59 & 0.41\\
59 \bottomrule
60 \end{tabular}
61 \quad
62 \begin{tabular}{lcccc}
63 \toprule
64 Instrumental & CC & DG & WDISS\\
65 \midrule
66 0.59 & 0.16 & 0.19 & 0.06\\
67 \bottomrule
68 \end{tabular}
69 \end{table}
70
71 \section{\acrlong{MFCC} Features}
72 The waveforms in itself are not very suitable to be used as features due to the
73 high dimensionality and correlation. Therefore we use the often used
74 \glspl{MFCC} feature vectors which has shown to be
75 suitable\cite{rocamora_comparing_2007}. It has also been found that altering
76 the mel scale to better suit singing does not yield a better
77 performance\cite{you_comparative_2015}. The actual conversion is done using the
78 \emph{python\_speech\_features}%
79 \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
80
81 \gls{MFCC} features are inspired by human auditory processing inspired and
82 built incrementally in several steps.
83 \begin{enumerate}
84 \item The first step in the process is converting the time representation
85 of the signal to a spectral representation using a sliding window with
86 overlap. The width of the window and the step size are two important
87 parameters in the system. In classical phonetic analysis window sizes
88 of $25ms$ with a step of $10ms$ are often chosen because they are small
89 enough to only contain subphone entities. Singing for $25ms$ is
90 impossible so it is arguable that the window size is very small.
91 \item The standard \gls{FT} gives a spectral representation that has
92 linearly scaled frequencies. This scale is converted to the \gls{MS}
93 using triangular overlapping windows to get a more tonotopic
94 representation trying to match the actual representation in the cochlea
95 of the human ear.
96 \item The \emph{Weber-Fechner} law that describes how humans perceive physical
97 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
98 Psychophysik} and it was found that energy is perceived in logarithmic
99 increments. This means that twice the amount of decibels does not mean
100 twice the amount of perceived loudness. Therefore in this step log is
101 taken of energy or amplitude of the \gls{MS} frequency spectrum to
102 closer match the human hearing.
103 \item The amplitudes of the spectrum are highly correlated and therefore
104 the last step is a decorrelation step. \Gls{DCT} is applied on the
105 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
106 describing a signal as a combination of several primitive cosine
107 functions.
108 \end{enumerate}
109
110 The default number of \gls{MFCC} parameters is twelve. However, often a
111 thirteenth value is added that represents the energy in the data.
112
113 \section{Experimental setup}
114 \subsection{Features}
115 The thirteen \gls{MFCC} features are chosen to feed to the classifier. The
116 parameters of the \gls{MFCC} features are varied in window step and window
117 length. The default speech processing parameters are tested but also bigger
118 window sizes since arguably the minimal size of a singing voice segment is a
119 lot bigger than the minimal size of a subphone component on which the
120 parameters are tuned. The parameters chosen are as follows:
121
122 \begin{table}[H]
123 \centering
124 \begin{tabular}{lll}
125 \toprule
126 step (ms) & length (ms) & notes\\
127 \midrule
128 10 & 25 & Standard speech processing\\
129 40 & 100 &\\
130 80 & 200 &\\
131 \bottomrule
132 \end{tabular}
133 \caption{\Gls{MFCC} parameter settings}
134 \end{table}
135
136 \subsection{\emph{Singing} voice detection}
137 The first type of experiment conducted is \emph{Singing} voice detection. This
138 is the act of segmenting an audio signal into segments that are labeled either
139 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
140 feature vector and the output is the probability that singing is happening in
141 the sample. This results in an \gls{ANN} of the shape described in
142 Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
143
144 \subsection{\emph{Singer} voice detection}
145 The second type of experiment conducted is \emph{Singer} voice detection. This
146 is the act of segmenting an audio signal into segments that are labeled either
147 with the name of the singer or as \emph{Instrumental}. The input of the
148 classifier is a feature vector and the outputs are probabilities for each of
149 the singers and a probability for the instrumental label. This results in an
150 \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
151 is yet again thirteen and the output dimension is the number of categories. The
152 output is encoded in one-hot encoding. This means that the categories are
153 labeled as \texttt{1000, 0100, 0010, 0001}.
154
155 \subsection{\acrlong{ANN}}
156 The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
157 The classification problems are only binary and four-class so therefore it is
158 interesting to see where the bottleneck lies. How abstract the abstraction can
159 go. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
160 using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
161 backend that provides a high-level interface to the highly technical networks.
162
163 The general architecture of the networks is show in Figure~\ref{fig:bcann} and
164 Figure~\ref{fig:mcann} for respectively the binary classification and
165 multiclass classification. The inputs are fully connected to the hidden layer
166 which is fully connected too the output layer. The activation function used is
167 a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
168 function that is also known as the ramp function. The definition is given in
169 Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and
170 efficient computation. The activation function between the hidden layer and the
171 output layer is the sigmoid function in the case of binary classification. Of
172 which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a
173 monotonic function that is differentiable on all values of $x$ and always
174 yields a non-negative derivative. For the multiclass classification the softmax
175 function is used between the hidden layer and the output layer. Softmax is an
176 activation function suitable for multiple output nodes. The definition is given
177 in Equation~\ref{eq:softmax}.
178
179 The data is shuffled before fed to the network to mitigate the risk of
180 over fitting on one album. Every model was trained using $10$ epochs and a
181 batch size of $32$.
182
183 \begin{equation}\label{eq:relu}
184 f(x) = \left\{\begin{array}{rcl}
185 0 & \text{for} & x<0\\
186 x & \text{for} & x \geq 0\\
187 \end{array}\right.
188 \end{equation}
189
190 \begin{equation}\label{eq:sigmoid}
191 f(x) = \frac{1}{1+e^{-x}}
192 \end{equation}
193
194 \begin{equation}\label{eq:softmax}
195 \delta{(\boldsymbol{z})}_j = \frac{e^{z_j}}{\sum\limits^{K}_{k=1}e^{z_k}}
196 \end{equation}
197
198 \begin{figure}[H]
199 \begin{subfigure}{.5\textwidth}
200 \centering
201 \includegraphics[width=.8\linewidth]{bcann}
202 \caption{Binary classifier network architecture}\label{fig:bcann}
203 \end{subfigure}%
204 %
205 \begin{subfigure}{.5\textwidth}
206 \includegraphics[width=.8\linewidth]{mcann}
207 \caption{Multiclass classifier network architecture}\label{fig:mcann}
208 \end{subfigure}
209 \caption{\acrlong{ANN} architectures.}
210 \end{figure}
211
212 \section{Results}
213 \subsection{\emph{Singing} voice detection}
214
215 \begin{table}[H]
216 \centering
217 \begin{tabular}{rccc}
218 \toprule
219 & \multicolumn{3}{c}{Parameters (step/length)}\\
220 & 10/25 & 40/100 & 80/200\\
221 \midrule
222 3h & 0.86 (0.34) & 0.87 (0.32) & 0.85 (0.35)\\
223 5h & 0.87 (0.31) & 0.88 (0.30) & 0.87 (0.32)\\
224 8h & 0.88 (0.30) & 0.88 (0.31) & 0.88 (0.29)\\
225 13h & 0.89 (0.28) & 0.89 (0.29) & 0.88 (0.30)\\
226 \bottomrule
227 \end{tabular}
228 \caption{Binary classification results (accuracy (loss))}
229 \end{table}
230
231 \begin{figure}[H]
232 \centering
233 \includegraphics[width=.6\linewidth]{bclass}.
234 \caption{Plotting the classifier under the audio signal}
235 \end{figure}
236
237 \subsection{\emph{Singer} voice detection}
238
239 \begin{table}[H]
240 \centering
241 \begin{tabular}{rccc}
242 \toprule
243 & \multicolumn{3}{c}{Parameters (step/length)}\\
244 & 10/25 & 40/100 & 80/200\\
245 \midrule
246 3h & 0.83 (0.48) & 0.82 (0.48) & 0.82 (0.48)\\
247 5h & 0.85 (0.43) & 0.84 (0.44) & 0.84 (0.44)\\
248 8h & 0.86 (0.41) & 0.86 (0.39) & 0.86 (0.40)\\
249 13h & 0.87 (0.37) & 0.87 (0.38) & 0.86 (0.39)\\
250 \bottomrule
251 \end{tabular}
252 \caption{Multiclass classification results (accuracy (loss))}
253 \end{table}
254
255 \subsection{Alien data}