piano's and synthesizers. The droning synthesizers often operate in the same
frequency as the vocals.
+Additional detailss about the dataset are listed in Appendix~\ref{app:data}.
The data is labeled as singing and instrumental and labeled per band. The
-distribution for this is shown in Table~\ref{tbl:distribution}. A random $10\%$
-of the data is extracted for a held out test set.
+distribution for this is shown in Table~\ref{tbl:distribution}.
\begin{table}[H]
\centering
\begin{tabular}{lcc}
\section{\acrlong{MFCC} Features}
The waveforms in itself are not very suitable to be used as features due to the
-high dimensionality and correlation. Therefore we use the often used
-\glspl{MFCC} feature vectors which have shown to be suitable~%
-\cite{rocamora_comparing_2007}. It has also been found that altering the mel
-scale to better suit singing does not yield a better
-performance~\cite{you_comparative_2015}. The actual conversion is done using the
-\emph{python\_speech\_features}%
-\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
-
-\gls{MFCC} features are inspired by human auditory processing inspired and are
+high dimensionality and correlation in the temporal domain. Therefore we use
+the often used \glspl{MFCC} feature vectors which have shown to be suitable in
+speech processing~\cite{rocamora_comparing_2007}. It has also been found that
+altering the mel scale to better suit singing does not yield a better
+performance~\cite{you_comparative_2015}. The actual conversion is done using
+the \emph{python\_speech\_features}\footnote{\url{%
+https://github.com/jameslyons/python_speech_features}} package.
+
+\gls{MFCC} features are inspired by human auditory processing and are
created from a waveform incrementally using several steps:
\begin{enumerate}
\item The first step in the process is converting the time representation
- of the signal to a spectral representation using a sliding window with
- overlap. The width of the window and the step size are two important
- parameters in the system. In classical phonetic analysis window sizes
- of $25ms$ with a step of $10ms$ are often chosen because they are small
- enough to only contain subphone entities. Singing for $25ms$ is
- impossible so it is arguable that the window size is very small.
+ of the signal to a spectral representation using a sliding analysis
+ window with overlap. The width of the window and the step size are two
+ important parameters in the system. In classical phonetic analysis
+ window sizes of $25ms$ with a step of $10ms$ are often chosen because
+ they are small enough to contain just one subphone event. Singing for
+ $25ms$ is impossible so it might be necessary to increase the window
+ size.
\item The standard \gls{FT} gives a spectral representation that has
linearly scaled frequencies. This scale is converted to the \gls{MS}
using triangular overlapping windows to get a more tonotopic
- representation trying to match the actual representation in the cochlea
- of the human ear.
+ representation trying to match the actual representation of the cochlea
+ in the human ear.
\item The \emph{Weber-Fechner} law describes how humans perceive physical
magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
Psychophysik}. They found that energy is perceived in logarithmic
- increments. This means that twice the amount of decibels does not mean
+ increments. This means that twice the amount of energy does not mean
twice the amount of perceived loudness. Therefore we take the log of
the energy or amplitude of the \gls{MS} spectrum to closer match the
human hearing.
\end{enumerate}
The default number of \gls{MFCC} parameters is twelve. However, often a
-thirteenth value is added that represents the energy in the data.
-
-\section{Experimental setup}
-\subsection{Features}
-The thirteen \gls{MFCC} features are chosen to feed to the classifier. The
-parameters of the \gls{MFCC} features are varied in window step and window
-length. The default speech processing parameters are tested but also bigger
-window sizes since arguably the minimal size of a singing voice segment is a
-lot bigger than the minimal size of a subphone component on which the
-parameters are tuned. The parameters chosen are as follows:
-
-\begin{table}[H]
- \centering
- \begin{tabular}{lll}
- \toprule
- step (ms) & length (ms) & notes\\
- \midrule
- 10 & 25 & Standard speech processing\\
- 40 & 100 &\\
- 80 & 200 &\\
- \bottomrule
- \end{tabular}
- \caption{\Gls{MFCC} parameter settings}
-\end{table}
-
-\subsection{\emph{Singing} voice detection}
-The first type of experiment conducted is \emph{Singing} voice detection. This
-is the act of segmenting an audio signal into segments that are labeled either
-as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
-feature vector and the output is the probability that singing is happening in
-the sample. This results in an \gls{ANN} of the shape described in
-Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
+thirteenth value is added that represents the energy in the analysis window.
+The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It
+represents the overall energy in the \gls{MS}. Another option would be
+$\log{(E)}$ which is the logarithm of the raw energy of the sample.
-\subsection{\emph{Singer} voice detection}
-The second type of experiment conducted is \emph{Singer} voice detection. This
-is the act of segmenting an audio signal into segments that are labeled either
-with the name of the singer or as \emph{Instrumental}. The input of the
-classifier is a feature vector and the outputs are probabilities for each of
-the singers and a probability for the instrumental label. This results in an
-\gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
-is yet again thirteen and the output dimension is the number of categories. The
-output is encoded in one-hot encoding. This means that the categories are
-labeled as \texttt{1000, 0100, 0010, 0001}.
-
-\subsection{\acrlong{ANN}}
+\section{\acrlong{ANN}}
The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
The classification problems are only binary and four-class so it is
interesting to see where the bottleneck lies; how abstract can the abstraction
function that is also known as the ramp function. The definition is given in
Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and
efficient computation. The activation function between the hidden layer and the
-output layer is the sigmoid function in the case of binary classification. Of
+output layer is the sigmoid function in the case of binary classification, of
which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a
monotonic function that is differentiable on all values of $x$ and always
yields a non-negative derivative. For the multiclass classification the softmax
\caption{\acrlong{ANN} architectures.}
\end{figure}
-\section{Results}
-\subsection{\emph{Singing} voice detection}
-Table~\ref{tbl:singing} shows the results for the singing-voice detection.
-Figure~\ref{fig:bclass} shows an example of a segment of a song with the
-classifier plotted underneath to illustrate the performance.
-
-\begin{table}[H]
- \centering
- \begin{tabular}{rccc}
- \toprule
- & \multicolumn{3}{c}{Parameters (step/length)}\\
- & 10/25 & 40/100 & 80/200\\
- \midrule
- 3h & 0.86 (0.34) & 0.87 (0.32) & 0.85 (0.35)\\
- 5h & 0.87 (0.31) & 0.88 (0.30) & 0.87 (0.32)\\
- 8h & 0.88 (0.30) & 0.88 (0.31) & 0.88 (0.29)\\
- 13h & 0.89 (0.28) & 0.89 (0.29) & 0.88 (0.30)\\
- \bottomrule
- \end{tabular}
- \caption{Binary classification results (accuracy
- (loss))}\label{tbl:singing}
-\end{table}
-
-\begin{figure}[H]
- \centering
- \includegraphics[width=.7\linewidth]{bclass}.
- \caption{Plotting the classifier under the audio signal}\label{fig:bclass}
-\end{figure}
-
-\subsection{\emph{Singer} voice detection}
-Table~\ref{tbl:singer} shows the results for the singer-voice detection.
+\section{Experimental setup}
+\subsection{Features}
+The thirteen \gls{MFCC} features are used as the input. Th parameters of the
+\gls{MFCC} features are varied in window step and window length. The default
+speech processing parameters are tested but also bigger window sizes since
+arguably the minimal size of a singing voice segment is a lot bigger than the
+minimal size of a subphone component on which the parameters are tuned. The
+parameters chosen are as follows:
\begin{table}[H]
\centering
- \begin{tabular}{rccc}
+ \begin{tabular}{lll}
\toprule
- & \multicolumn{3}{c}{Parameters (step/length)}\\
- & 10/25 & 40/100 & 80/200\\
+ step (ms) & length (ms) & notes\\
\midrule
- 3h & 0.83 (0.48) & 0.82 (0.48) & 0.82 (0.48)\\
- 5h & 0.85 (0.43) & 0.84 (0.44) & 0.84 (0.44)\\
- 8h & 0.86 (0.41) & 0.86 (0.39) & 0.86 (0.40)\\
- 13h & 0.87 (0.37) & 0.87 (0.38) & 0.86 (0.39)\\
+ 10 & 25 & Standard speech processing\\
+ 40 & 100 &\\
+ 80 & 200 &\\
\bottomrule
\end{tabular}
- \caption{Multiclass classification results (accuracy
- (loss))}\label{tbl:singer}
+ \caption{\Gls{MFCC} parameter settings}
\end{table}
-\subsection{Alien data}
-To test the generalizability of the models the system is tested on alien data.
-The data was retrieved from the album \emph{The Desperation} by \emph{Godless
-Truth}. \emph{Godless Truth} is a so called old-school \gls{dm} band that has
-very raspy vocals and the vocals are very up front in the mastering. This means
-that the vocals are very prevalent in the recording and therefore no difficulty
-is expected for the classifier. Figure~\ref{fig:alien1} shows that indeed the
-classifier scores very accurately. Note that the spectogram settings have been
-adapted a little bit to make the picture more clear. The spectogram shows the
-frequency range from $0$ to $3000Hz$.
-
-\begin{figure}[H]
- \centering
- \includegraphics[width=.7\linewidth]{alien1}.
- \caption{Plotting the classifier under similar alien data}\label{fig:alien1}
-\end{figure}
-
-To really test the limits, a song from the highly atmospheric doom metal band
-called \emph{Catacombs} has been tested on the system. The album \emph{Echoes
-Through the Catacombs} is an album that has a lot of synthesizers, heavy
-droning guitars and bass lines. The vocals are not mixed in a way that makes
-them stand out. The models have never seen trainingsdata that is even remotely
-similar to this type of metal. Figure~\ref{fig:alien2} shows a segment of the
-data. Here it is clearly visible that the classifier can not distinguish
-singing from non singing.
+\subsection{\emph{Singing}-voice detection}
+The first type of experiment conducted is \emph{Singing}-voice detection. This
+is the act of segmenting an audio signal into segments that are labeled either
+as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a
+feature vector and the output is the probability that singing is happening in
+the sample. This results in an \gls{ANN} of the shape described in
+Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one.
-\begin{figure}[H]
- \centering
- \includegraphics[width=.7\linewidth]{alien2}.
- \caption{Plotting the classifier under different alien data}\label{fig:alien2}
-\end{figure}
+\subsection{\emph{Singer}-voice detection}
+The second type of experiment conducted is \emph{Singer}-voice detection. This
+is the act of segmenting an audio signal into segments that are labeled either
+with the name of the singer or as \emph{Instrumental}. The input of the
+classifier is a feature vector and the outputs are probabilities for each of
+the singers and a probability for the instrumental label. This results in an
+\gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension
+is yet again thirteen and the output dimension is the number of categories. The
+output is encoded in one-hot encoding. This means that the categories are
+labeled as \texttt{1000, 0100, 0010, 0001}.
--- /dev/null
+\section{\emph{Singing}-voice detection}
+Table~\ref{tbl:singing} shows the results for the singing-voice detection.
+Figure~\ref{fig:bclass} shows an example of a segment of a song with the
+classifier plotted underneath to illustrate the performance. The performance is
+given by the accuracy and loss. The accuracy is the percentage of correctly
+classified samples.
+
+\begin{table}[H]
+ \centering
+ \begin{tabular}{rccc}
+ \toprule
+ & \multicolumn{3}{c}{Parameters (step/length)}\\
+ & 10/25 & 40/100 & 80/200\\
+ \midrule
+ \multirow{4}{*}{Hidden Nodes}
+ & 0.86 (0.34) & 0.87 (0.32) & 0.85 (0.35)\\
+ & 0.87 (0.31) & 0.88 (0.30) & 0.87 (0.32)\\
+ & 0.88 (0.30) & 0.88 (0.31) & 0.88 (0.29)\\
+ & 0.89 (0.28) & 0.89 (0.29) & 0.88 (0.30)\\
+ \bottomrule
+ \end{tabular}
+ \caption{Binary classification results (accuracy
+ (loss))}\label{tbl:singing}
+\end{table}
+
+Plotting the classifier under a segment of the data results in
+Figure~\ref{fig:bclass}.
+
+\begin{figure}[H]
+ \centering
+ \includegraphics[width=.7\linewidth]{bclass}.
+ \caption{Plotting the classifier under the audio signal}\label{fig:bclass}
+\end{figure}
+
+\section{\emph{Singer}-voice detection}
+Table~\ref{tbl:singer} shows the results for the singer-voice detection. The
+same metrics are used as in \emph{Singing}-voice detection.
+
+\begin{table}[H]
+ \centering
+ \begin{tabular}{rccc}
+ \toprule
+ & \multicolumn{3}{c}{Parameters (step/length)}\\
+ & 10/25 & 40/100 & 80/200\\
+ \midrule
+ \multirow{4}{*}{Hidden Nodes}
+ & 0.83 (0.48) & 0.82 (0.48) & 0.82 (0.48)\\
+ & 0.85 (0.43) & 0.84 (0.44) & 0.84 (0.44)\\
+ & 0.86 (0.41) & 0.86 (0.39) & 0.86 (0.40)\\
+ & 0.87 (0.37) & 0.87 (0.38) & 0.86 (0.39)\\
+ \bottomrule
+ \end{tabular}
+ \caption{Multiclass classification results (accuracy
+ (loss))}\label{tbl:singer}
+\end{table}
+
+\section{Alien data}
+To test the generalizability of the models the system is tested on alien data.
+The data was retrieved from the album \emph{The Desperation} by \emph{Godless
+Truth}. \emph{Godless Truth} is a so called old-school \gls{dm} band that has
+very raspy vocals and the vocals are very up front in the mastering. This means
+that the vocals are very prevalent in the recording and therefore no difficulty
+is expected for the classifier. Figure~\ref{fig:alien1} shows that indeed the
+classifier scores very accurately. Note that the spectogram settings have been
+adapted a little bit to make the picture more clear. The spectogram shows the
+frequency range from $0$ to $3000Hz$.
+
+\begin{figure}[H]
+ \centering
+ \includegraphics[width=.7\linewidth]{alien1}.
+ \caption{Plotting the classifier with alien data containing familiar vocal
+ styles}\label{fig:alien1}
+\end{figure}
+
+To really test the limits, a song from the highly atmospheric doom metal band
+called \emph{Catacombs} has been tested on the system. The album \emph{Echoes
+Through the Catacombs} is an album that has a lot of synthesizers, heavy
+droning guitars and bass lines. The vocals are not mixed in a way that makes
+them stand out. The models have never seen trainingsdata that is even remotely
+similar to this type of metal. Figure~\ref{fig:alien2} shows a segment of the
+data. It is visible that the classifier can not distinguish singing from non
+singing.
+
+\begin{figure}[H]
+ \centering
+ \includegraphics[width=.7\linewidth]{alien2}.
+ \caption{Plotting the classifier with alien data containing strange vocal
+ styles}\label{fig:alien2}
+\end{figure}