small update
[asr1617.git] / methods.tex
1 %Methodology
2
3 %Experiment(s) (set-up, data, results, discussion)
4 \section{Data \& Preprocessing}
5 To answer the research question, several experiments have been performed. Data
6 has been collected from several \gls{dm} and \gls{dom} albums. The exact data
7 used is available in Appendix~\ref{app:data}. The albums are extracted from the
8 audio CD and converted to a mono channel waveform with the correct samplerate
9 utilizing \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. Every file
10 is annotated using Praat~\cite{boersma_praat_2002} where the lyrics are
11 manually aligned to the audio. Examples of utterances are shown in
12 Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the
13 waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible
14 that within the genre of death metal there are different spectral patterns
15 visible over time.
16
17 \begin{figure}[ht]
18 \centering
19 \includegraphics[width=.7\linewidth]{cement}
20 \caption{A vocal segment of the \acrlong{CC} song
21 \emph{Bloodstained Cement}}\label{fig:bloodstained}
22 \end{figure}
23
24 \begin{figure}[ht]
25 \centering
26 \includegraphics[width=.7\linewidth]{abominations}
27 \caption{A vocal segment of the \acrlong{DG} song
28 \emph{Enthroned Abominations}}\label{fig:abominations}
29 \end{figure}
30
31 The data is collected from three studio albums. The first band is called
32 \gls{CC} and has been producing \gls{dm} for almost 25 years and has been
33 creating albums with a consistent style. The singer of \gls{CC} has a very
34 raspy growl and the lyrics are quite comprehensible. The vocals produced by
35 \gls{CC} are very close to regular shouting.
36
37 The second band is called \gls{DG} and makes even more violently
38 sounding music. The growls of the lead singer sound like a coffee grinder and
39 are sound less full. In the spectrals it is clearly visible that there are
40 overtones produced during some parts of the growling. The lyrics are completely
41 incomprehensible and therefore some parts were not annotated with the actual
42 lyrics because it was impossible to hear what was being sung.
43
44 The third band --- originating from Moscow --- is chosen bearing the name
45 \gls{WDISS}. This band is a little odd compared to the previous \gls{dm} bands
46 because they create \gls{dom}. \gls{dom} is characterized by the very slow
47 tempo and low tuned guitars. The vocalist has a very characteristic growl and
48 performs in several Muscovite bands. This band also stands out because it uses
49 piano's and synthesizers. The droning synthesizers often operate in the same
50 frequency as the vocals.
51
52 Additional detailss about the dataset are listed in Appendix~\ref{app:data}.
53 The data is labeled as singing and instrumental and labeled per band. The
54 distribution for this is shown in Table~\ref{tbl:distribution}.
55 \begin{table}[H]
56 \centering
57 \begin{tabular}{lcc}
58 \toprule
59 Instrumental & Singing\\
60 \midrule
61 0.59 & 0.41\\
62 \bottomrule
63 \end{tabular}
64 \quad
65 \begin{tabular}{lcccc}
66 \toprule
67 Instrumental & \gls{CC} & \gls{DG} & \gls{WDISS}\\
68 \midrule
69 0.59 & 0.16 & 0.19 & 0.06\\
70 \bottomrule
71 \end{tabular}
72 \caption{Data distribution}\label{tbl:distribution}
73 \end{table}
74
75 \section{\acrlong{MFCC} Features}
76 The waveforms in itself are not very suitable to be used as features due to the
77 high dimensionality and correlation in the temporal domain. Therefore we use
78 the often used \glspl{MFCC} feature vectors which have shown to be suitable in
79 speech processing~\cite{rocamora_comparing_2007}. It has also been found that
80 altering the mel scale to better suit singing does not yield a better
81 performance~\cite{you_comparative_2015}. The actual conversion is done using
82 the \emph{python\_speech\_features}\footnote{\url{%
83 https://github.com/jameslyons/python_speech_features}} package.
84
85 \gls{MFCC} features are inspired by human auditory processing and are
86 created from a waveform incrementally using several steps:
87 \begin{enumerate}
88 \item The first step in the process is converting the time representation
89 of the signal to a spectral representation using a sliding analysis
90 window with overlap. The width of the window and the step size are two
91 important parameters in the system. In classical phonetic analysis
92 window sizes of $25ms$ with a step of $10ms$ are often chosen because
93 they are small enough to contain just one subphone event. Singing for
94 $25ms$ is impossible so it might be necessary to increase the window
95 size.
96 \item The standard \gls{FT} gives a spectral representation that has
97 linearly scaled frequencies. This scale is converted to the \gls{MS}
98 using triangular overlapping windows to get a more tonotopic
99 representation trying to match the actual representation of the cochlea
100 in the human ear.
101 \item The \emph{Weber-Fechner} law describes how humans perceive physical
102 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
103 Psychophysik}. They found that energy is perceived in logarithmic
104 increments. This means that twice the amount of energy does not mean
105 twice the amount of perceived loudness. Therefore we take the log of
106 the energy or amplitude of the \gls{MS} spectrum to closer match the
107 human hearing.
108 \item The amplitudes of the spectrum are highly correlated and therefore
109 the last step is a decorrelation step. \Gls{DCT} is applied on the
110 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
111 describing a signal as a combination of several primitive cosine
112 functions.
113 \end{enumerate}
114
115 The default number of \gls{MFCC} parameters is twelve. However, often a
116 thirteenth value is added that represents the energy in the analysis window.
117 The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It
118 represents the overall energy in the \gls{MS}. Another option would be
119 $\log{(E)}$ which is the logarithm of the raw energy of the sample.
120
121 \section{\acrlong{ANN}}
122 The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
123 The classification problems are only binary and four-class so it is
124 interesting to see where the bottleneck lies; how abstract can the abstraction
125 be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
126 using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
127 backend that provides a high-level interface to the highly technical networks.
128
129 The general architecture of the networks is show in Figure~\ref{fig:bcann} and
130 Figure~\ref{fig:mcann} for respectively the binary classification and
131 multiclass classification. The inputs are fully connected to the hidden layer
132 which is fully connected too the output layer. The activation function used is
133 a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided
134 function that is also known as the ramp function. The definition is given in
135 Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and
136 efficient computation. The activation function between the hidden layer and the
137 output layer is the sigmoid function in the case of binary classification, of
138 which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a
139 monotonic function that is differentiable on all values of $x$ and always
140 yields a non-negative derivative. For the multiclass classification the softmax
141 function is used between the hidden layer and the output layer. Softmax is an
142 activation function suitable for multiple output nodes. The definition is given
143 in Equation~\ref{eq:softmax}.
144
145 The data is shuffled before fed to the network to mitigate the risk of
146 overfitting on one album. Every model was trained using $10$ epochs and a
147 batch size of $32$. The training set and test set are separated by taking a
148 $90\%$ slice of all the data.
149
150 \begin{equation}\label{eq:relu}
151 f(x) = \left\{\begin{array}{rcl}
152 0 & \text{for} & x<0\\
153 x & \text{for} & x \geq 0\\
154 \end{array}\right.
155 \end{equation}
156
157 \begin{equation}\label{eq:sigmoid}
158 f(x) = \frac{1}{1+e^{-x}}
159 \end{equation}
160
161 \begin{equation}\label{eq:softmax}
162 \delta{(\boldsymbol{z})}_j = \frac{e^{z_j}}{\sum\limits^{K}_{k=1}e^{z_k}}
163 \end{equation}
164
165 \begin{figure}[H]
166 \begin{subfigure}{.5\textwidth}
167 \centering
168 \includegraphics[width=.8\linewidth]{bcann}
169 \caption{Binary classifier network architecture}\label{fig:bcann}
170 \end{subfigure}%
171 %
172 \begin{subfigure}{.5\textwidth}
173 \centering
174 \includegraphics[width=.8\linewidth]{mcann}
175 \caption{Multiclass classifier network architecture}\label{fig:mcann}
176 \end{subfigure}
177 \caption{\acrlong{ANN} architectures.}
178 \end{figure}
179
180 \section{Experimental setup}
181 \subsection{Features}
182 The thirteen \gls{MFCC} features are used as the input. Th parameters of the
183 \gls{MFCC} features are varied in window step and window length. The default
184 speech processing parameters are tested but also bigger window sizes since
185 arguably the minimal size of a singing voice segment is a lot bigger than the
186 minimal size of a subphone component on which the parameters are tuned. The
187 parameters chosen are as follows:
188
189 \begin{table}[H]
190 \centering
191 \begin{tabular}{lll}
192 \toprule
193 step (ms) & length (ms) & notes\\
194 \midrule
195 10 & 25 & Standard speech processing\\
196 40 & 100 &\\
197 80 & 200 &\\
198 \bottomrule
199 \end{tabular}
200 \caption{\Gls{MFCC} parameter settings}
201 \end{table}
202
203 \subsection{\emph{Singing}-voice detection}
204 The first type of experiment conducted is \emph{Singing}-voice detection. This
205 is the act of segmenting an audio signal into segments that are labeled either
206 as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
207 feature vector and the output is the probability that singing is happening in
208 the sample. This results in an \gls{ANN} of the shape described in
209 Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
210
211 \subsection{\emph{Singer}-voice detection}
212 The second type of experiment conducted is \emph{Singer}-voice detection. This
213 is the act of segmenting an audio signal into segments that are labeled either
214 with the name of the singer or as \emph{Instrumental}. The input of the
215 classifier is a feature vector and the outputs are probabilities for each of
216 the singers and a probability for the instrumental label. This results in an
217 \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
218 is yet again thirteen and the output dimension is the number of categories. The
219 output is encoded in one-hot encoding. This means that the categories are
220 labeled as \texttt{1000, 0100, 0010, 0001}.