own titlepage with correct logo
[asr1617.git] / methods.tex
1 %Methodology
2
3 %Experiment(s) (set-up, data, results, discussion)
4 \section{Data \& Preprocessing}
5 To answer the research question, several experiments have been performed. Data
6 has been collected from several \gls{dm} and \gls{dom} albums. The exact data
7 used is available in Appendix~\ref{app:data}. The albums are extracted from the
8 audio CD and converted to a mono channel waveform with the correct samplerate
9 utilizing \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. Every file
10 is annotated using Praat~\cite{boersma_praat_2002} where the lyrics are
11 manually aligned to the audio. Examples of utterances are shown in
12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
13 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
14 that within the genre of death metal there are different spectral patterns
15 visible over time.
16
17 \begin{figure}[ht]
18 \centering
19 \includegraphics[width=.7\linewidth]{cement}
20 \caption{A vocal segment of the Cannibal Corpse song
21 \emph{Bloodstained Cement}}\label{fig:bloodstained}
22 \end{figure}
23
24 \begin{figure}[ht]
25 \centering
26 \includegraphics[width=.7\linewidth]{abominations}
27 \caption{A vocal segment of the Disgorge song
28 \emph{Enthroned Abominations}}\label{fig:abominations}
29 \end{figure}
30
31 The data is collected from three studio albums. The first band is called
32 \gls{CC} and has been producing \gls{dm} for almost 25 years and has been
33 creating albums with a consistent style. The singer of \gls{CC} has a very
34 raspy growl and the lyrics are quite comprehensible. The vocals produced by
35 \gls{CC} are very close to regular shouting.
36
37 The second band is called \gls{DG} and makes even more violently
38 sounding music. The growls of the lead singer sound like a coffee grinder and
39 are sound less full. In the spectrals it is clearly visible that there are
40 overtones produced during some parts of the growling. The lyrics are completely
41 incomprehensible and therefore some parts were not annotated with the actual
42 lyrics because it was impossible to hear what was being sung.
43
44 The third band --- originating from Moscow --- is chosen bearing the name
45 \gls{WDISS}. This band is a little odd compared to the previous \gls{dm} bands
46 because they create \gls{dom}. \gls{dom} is characterized by the very slow
47 tempo and low tuned guitars. The vocalist has a very characteristic growl and
48 performs in several Muscovite bands. This band also stands out because it uses
49 piano's and synthesizers. The droning synthesizers often operate in the same
50 frequency as the vocals.
51
52 Additional details about the dataset are listed in Appendix~\ref{app:data}.
53 The data is labeled as singing and instrumental and labeled per band. The
54 distribution for this is shown in Table~\ref{tbl:distribution}.
55 \begin{table}[H]
56 \centering
57 \begin{tabular}{lcc}
58 \toprule
59 Instrumental & Singing\\
60 \midrule
61 0.59 & 0.41\\
62 \bottomrule
63 \end{tabular}
64 \quad
65 \begin{tabular}{lcccc}
66 \toprule
67 Instrumental & \gls{CC} & \gls{DG} & \gls{WDISS}\\
68 \midrule
69 0.59 & 0.16 & 0.19 & 0.06\\
70 \bottomrule
71 \end{tabular}
72 \caption{Data distribution}\label{tbl:distribution}
73 \end{table}
74
75 \section{Mel-frequencey Cepstral Features}
76 The waveforms in itself are not very suitable to be used as features due to the
77 high dimensionality and correlation in the temporal domain. Therefore we use
78 the often used \glspl{MFCC} feature vectors which have shown to be suitable in
79 speech processing~\cite{rocamora_comparing_2007}. It has also been found that
80 altering the mel scale to better suit singing does not yield a better
81 performance~\cite{you_comparative_2015}. The actual conversion is done using
82 the \emph{python\_speech\_features}\footnote{\url{%
83 https://github.com/jameslyons/python_speech_features}} package.
84
85 \gls{MFCC} features are inspired by human auditory processing and are
86 created from a waveform incrementally using several steps:
87 \begin{enumerate}
88 \item The first step in the process is converting the time representation
89 of the signal to a spectral representation using a sliding analysis
90 window with overlap. The width of the window and the step size are two
91 important parameters in the system. In classical phonetic analysis
92 window sizes of $25ms$ with a step of $10ms$ are often chosen because
93 they are small enough to contain just one subphone event. Singing for
94 $25ms$ is impossible so it might be necessary to increase the window
95 size.
96 \item The standard \gls{FT} gives a spectral representation that has
97 linearly scaled frequencies. This scale is converted to the \gls{MS}
98 using triangular overlapping windows to get a more tonotopic
99 representation trying to match the actual representation of the cochlea
100 in the human ear.
101 \item The \emph{Weber-Fechner} law describes how humans perceive physical
102 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
103 Psychophysik}. They found that energy is perceived in logarithmic
104 increments. This means that twice the amount of energy does not mean
105 twice the amount of perceived loudness. Therefore we take the logarithm
106 of the energy or amplitude of the \gls{MS} spectrum to closer match the
107 human hearing.
108 \item The amplitudes of the spectrum are highly correlated and therefore
109 the last step is a decorrelation step. \Gls{DCT} is applied on the
110 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
111 describing a signal as a combination of several primitive cosine
112 functions.
113 \end{enumerate}
114
115 The default number of \gls{MFCC} parameters is twelve. However, often a
116 thirteenth value is added that represents the energy in the analysis window.
117 The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It
118 represents the overall energy in the \gls{MS}. Another option would be
119 $\log{(E)}$ which is the logarithm of the raw energy of the sample.
120
121 \section{Artificial Neural Network}
122 The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
123 The classification problems are only binary or four-class problems so it is
124 interesting to see where the bottleneck lies; how abstract can the abstraction
125 be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
126 using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
127 backend that provides a high-level interface to the highly technical networks.
128
129 The general architecture of the networks is show in Figure~\ref{fig:bcann} and
130 Figure~\ref{fig:mcann} for respectively the binary classification and
131 multiclass classification. The inputs are fully connected to the hidden layer
132 which is fully connected too the output layer. The activation function used is
133 a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
134 function that is also known as the ramp function. The definition is given in
135 Equation~\ref{eq:relu}. \gls{RELU} has the downside that it can create
136 unreachable nodes in a deep network. This is not a problem in this network
137 since it only has one hidden layer. \gls{RELU} was also chosen because of its
138 efficient computation and nature inspiredness.
139
140 The activation function between the hidden layer and the output layer is the
141 sigmoid function in the case of binary classification, of which the definition
142 is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a monotonic function that
143 is differentiable on all values of $x$ and always yields a non-negative
144 derivative. For the multiclass classification the softmax function is used
145 between the hidden layer and the output layer. Softmax is an activation
146 function suitable for multiple output nodes. The definition is given in
147 Equation~\ref{eq:softmax}.
148
149 The data is shuffled before fed to the network to mitigate the risk of
150 overfitting on one album. Every model was trained using $10$ epochs which means
151 that all training data is offered to the model $10$ times. The training set and
152 test set are separated by taking a $90\%$ slice of all the data.
153
154 \begin{equation}\label{eq:relu}
155 f(x) = \left\{\begin{array}{rcl}
156 0 & \text{for} & x<0\\
157 x & \text{for} & x \geq 0\\
158 \end{array}\right.
159 \end{equation}
160
161 \begin{equation}\label{eq:sigmoid}
162 f(x) = \frac{1}{1+e^{-x}}
163 \end{equation}
164
165 \begin{equation}\label{eq:softmax}
166 \delta{(\boldsymbol{z})}_j = \frac{e^{z_j}}{\sum\limits^{K}_{k=1}e^{z_k}}
167 \end{equation}
168
169 \begin{figure}[H]
170 \begin{subfigure}{.5\textwidth}
171 \centering
172 \includegraphics[width=.8\linewidth]{bcann}
173 \caption{Binary classifier network architecture}\label{fig:bcann}
174 \end{subfigure}%
175 %
176 \begin{subfigure}{.5\textwidth}
177 \centering
178 \includegraphics[width=.8\linewidth]{mcann}
179 \caption{Multiclass classifier network architecture}\label{fig:mcann}
180 \end{subfigure}
181 \caption{Artificial Neural Network architectures.}
182 \end{figure}
183
184 \section{Experimental setup}
185 \subsection{Features}
186 The thirteen \gls{MFCC} features are used as the input. The parameters of the
187 \gls{MFCC} features are varied in window step and window length. The default
188 speech processing parameters are tested but also bigger window sizes since
189 arguably the minimal size of a singing voice segment is a lot bigger than the
190 minimal size of a subphone component on which the parameters are tuned. The
191 parameters chosen are as follows:
192
193 \begin{table}[H]
194 \centering
195 \begin{tabular}{lll}
196 \toprule
197 step (ms) & length (ms) & notes\\
198 \midrule
199 10 & 25 & Standard speech processing\\
200 40 & 100 &\\
201 80 & 200 &\\
202 \bottomrule
203 \end{tabular}
204 \caption{\Gls{MFCC} parameter settings}
205 \end{table}
206
207 \subsection{\emph{Singing}-voice detection}
208 The first type of experiment conducted is \emph{Singing}-voice detection. This
209 is the act of segmenting an audio signal into segments that are labeled either
210 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
211 feature vector and the output is the probability that singing is happening in
212 the sample. This results in an \gls{ANN} of the shape described in
213 Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
214
215 \subsection{\emph{Singer}-voice detection}
216 The second type of experiment conducted is \emph{Singer}-voice detection. This
217 is the act of segmenting an audio signal into segments that are labeled either
218 with the name of the singer or as \emph{Instrumental}. The input of the
219 classifier is a feature vector and the outputs are probabilities for each of
220 the singers and a probability for the instrumental label. This results in an
221 \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
222 is yet again thirteen and the output dimension is the number of categories. The
223 output is encoded in one-hot encoding. This means that the categories are
224 labeled as \texttt{1000, 0100, 0010, 0001}.