From 70d6baa9f4eb56446ac82484010548f099f84de6 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Wed, 7 Jun 2017 15:02:00 +0200 Subject: [PATCH] split out results, process comments --- asr.tex | 3 + conclusion.tex | 2 +- methods.tex | 193 +++++++++++++++---------------------------------- results.tex | 89 +++++++++++++++++++++++ 4 files changed, 151 insertions(+), 136 deletions(-) create mode 100644 results.tex diff --git a/asr.tex b/asr.tex index 938078f..dc11df3 100644 --- a/asr.tex +++ b/asr.tex @@ -28,6 +28,9 @@ \chapter{Methods} \input{methods} +\chapter{Results} +\input{results} + \chapter{Conclusion \& Discussion} \input{conclusion} diff --git a/conclusion.tex b/conclusion.tex index 96d4fba..8f53277 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -1,4 +1,4 @@ -\section{Conclusion \& Future Research} +\section{Conclusion} This study shows that existing techniques for singing-voice detection designed for regular singing-voices also work respectably on extreme singing styles like grunting. With a standard \gls{ANN} classifier using \gls{MFCC} diff --git a/methods.tex b/methods.tex index 74c60af..23ae993 100644 --- a/methods.tex +++ b/methods.tex @@ -49,9 +49,9 @@ performs in several Muscovite bands. This band also stands out because it uses piano's and synthesizers. The droning synthesizers often operate in the same frequency as the vocals. +Additional detailss about the dataset are listed in Appendix~\ref{app:data}. The data is labeled as singing and instrumental and labeled per band. The -distribution for this is shown in Table~\ref{tbl:distribution}. A random $10\%$ -of the data is extracted for a held out test set. +distribution for this is shown in Table~\ref{tbl:distribution}. \begin{table}[H] \centering \begin{tabular}{lcc} @@ -74,33 +74,34 @@ of the data is extracted for a held out test set. \section{\acrlong{MFCC} Features} The waveforms in itself are not very suitable to be used as features due to the -high dimensionality and correlation. Therefore we use the often used -\glspl{MFCC} feature vectors which have shown to be suitable~% -\cite{rocamora_comparing_2007}. It has also been found that altering the mel -scale to better suit singing does not yield a better -performance~\cite{you_comparative_2015}. The actual conversion is done using the -\emph{python\_speech\_features}% -\footnote{\url{https://github.com/jameslyons/python_speech_features}} package. - -\gls{MFCC} features are inspired by human auditory processing inspired and are +high dimensionality and correlation in the temporal domain. Therefore we use +the often used \glspl{MFCC} feature vectors which have shown to be suitable in +speech processing~\cite{rocamora_comparing_2007}. It has also been found that +altering the mel scale to better suit singing does not yield a better +performance~\cite{you_comparative_2015}. The actual conversion is done using +the \emph{python\_speech\_features}\footnote{\url{% +https://github.com/jameslyons/python_speech_features}} package. + +\gls{MFCC} features are inspired by human auditory processing and are created from a waveform incrementally using several steps: \begin{enumerate} \item The first step in the process is converting the time representation - of the signal to a spectral representation using a sliding window with - overlap. The width of the window and the step size are two important - parameters in the system. In classical phonetic analysis window sizes - of $25ms$ with a step of $10ms$ are often chosen because they are small - enough to only contain subphone entities. Singing for $25ms$ is - impossible so it is arguable that the window size is very small. + of the signal to a spectral representation using a sliding analysis + window with overlap. The width of the window and the step size are two + important parameters in the system. In classical phonetic analysis + window sizes of $25ms$ with a step of $10ms$ are often chosen because + they are small enough to contain just one subphone event. Singing for + $25ms$ is impossible so it might be necessary to increase the window + size. \item The standard \gls{FT} gives a spectral representation that has linearly scaled frequencies. This scale is converted to the \gls{MS} using triangular overlapping windows to get a more tonotopic - representation trying to match the actual representation in the cochlea - of the human ear. + representation trying to match the actual representation of the cochlea + in the human ear. \item The \emph{Weber-Fechner} law describes how humans perceive physical magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der Psychophysik}. They found that energy is perceived in logarithmic - increments. This means that twice the amount of decibels does not mean + increments. This means that twice the amount of energy does not mean twice the amount of perceived loudness. Therefore we take the log of the energy or amplitude of the \gls{MS} spectrum to closer match the human hearing. @@ -112,51 +113,12 @@ created from a waveform incrementally using several steps: \end{enumerate} The default number of \gls{MFCC} parameters is twelve. However, often a -thirteenth value is added that represents the energy in the data. - -\section{Experimental setup} -\subsection{Features} -The thirteen \gls{MFCC} features are chosen to feed to the classifier. The -parameters of the \gls{MFCC} features are varied in window step and window -length. The default speech processing parameters are tested but also bigger -window sizes since arguably the minimal size of a singing voice segment is a -lot bigger than the minimal size of a subphone component on which the -parameters are tuned. The parameters chosen are as follows: - -\begin{table}[H] - \centering - \begin{tabular}{lll} - \toprule - step (ms) & length (ms) & notes\\ - \midrule - 10 & 25 & Standard speech processing\\ - 40 & 100 &\\ - 80 & 200 &\\ - \bottomrule - \end{tabular} - \caption{\Gls{MFCC} parameter settings} -\end{table} - -\subsection{\emph{Singing} voice detection} -The first type of experiment conducted is \emph{Singing} voice detection. This -is the act of segmenting an audio signal into segments that are labeled either -as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a -feature vector and the output is the probability that singing is happening in -the sample. This results in an \gls{ANN} of the shape described in -Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one. +thirteenth value is added that represents the energy in the analysis window. +The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It +represents the overall energy in the \gls{MS}. Another option would be +$\log{(E)}$ which is the logarithm of the raw energy of the sample. -\subsection{\emph{Singer} voice detection} -The second type of experiment conducted is \emph{Singer} voice detection. This -is the act of segmenting an audio signal into segments that are labeled either -with the name of the singer or as \emph{Instrumental}. The input of the -classifier is a feature vector and the outputs are probabilities for each of -the singers and a probability for the instrumental label. This results in an -\gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension -is yet again thirteen and the output dimension is the number of categories. The -output is encoded in one-hot encoding. This means that the categories are -labeled as \texttt{1000, 0100, 0010, 0001}. - -\subsection{\acrlong{ANN}} +\section{\acrlong{ANN}} The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}. The classification problems are only binary and four-class so it is interesting to see where the bottleneck lies; how abstract can the abstraction @@ -172,7 +134,7 @@ a \gls{RELU}. The \gls{RELU} function is a monotonic symmetric one-sided function that is also known as the ramp function. The definition is given in Equation~\ref{eq:relu}. \gls{RELU} was chosen because of its symmetry and efficient computation. The activation function between the hidden layer and the -output layer is the sigmoid function in the case of binary classification. Of +output layer is the sigmoid function in the case of binary classification, of which the definition is shown in Equation~\ref{eq:sigmoid}. The sigmoid is a monotonic function that is differentiable on all values of $x$ and always yields a non-negative derivative. For the multiclass classification the softmax @@ -215,83 +177,44 @@ $90\%$ slice of all the data. \caption{\acrlong{ANN} architectures.} \end{figure} -\section{Results} -\subsection{\emph{Singing} voice detection} -Table~\ref{tbl:singing} shows the results for the singing-voice detection. -Figure~\ref{fig:bclass} shows an example of a segment of a song with the -classifier plotted underneath to illustrate the performance. - -\begin{table}[H] - \centering - \begin{tabular}{rccc} - \toprule - & \multicolumn{3}{c}{Parameters (step/length)}\\ - & 10/25 & 40/100 & 80/200\\ - \midrule - 3h & 0.86 (0.34) & 0.87 (0.32) & 0.85 (0.35)\\ - 5h & 0.87 (0.31) & 0.88 (0.30) & 0.87 (0.32)\\ - 8h & 0.88 (0.30) & 0.88 (0.31) & 0.88 (0.29)\\ - 13h & 0.89 (0.28) & 0.89 (0.29) & 0.88 (0.30)\\ - \bottomrule - \end{tabular} - \caption{Binary classification results (accuracy - (loss))}\label{tbl:singing} -\end{table} - -\begin{figure}[H] - \centering - \includegraphics[width=.7\linewidth]{bclass}. - \caption{Plotting the classifier under the audio signal}\label{fig:bclass} -\end{figure} - -\subsection{\emph{Singer} voice detection} -Table~\ref{tbl:singer} shows the results for the singer-voice detection. +\section{Experimental setup} +\subsection{Features} +The thirteen \gls{MFCC} features are used as the input. Th parameters of the +\gls{MFCC} features are varied in window step and window length. The default +speech processing parameters are tested but also bigger window sizes since +arguably the minimal size of a singing voice segment is a lot bigger than the +minimal size of a subphone component on which the parameters are tuned. The +parameters chosen are as follows: \begin{table}[H] \centering - \begin{tabular}{rccc} + \begin{tabular}{lll} \toprule - & \multicolumn{3}{c}{Parameters (step/length)}\\ - & 10/25 & 40/100 & 80/200\\ + step (ms) & length (ms) & notes\\ \midrule - 3h & 0.83 (0.48) & 0.82 (0.48) & 0.82 (0.48)\\ - 5h & 0.85 (0.43) & 0.84 (0.44) & 0.84 (0.44)\\ - 8h & 0.86 (0.41) & 0.86 (0.39) & 0.86 (0.40)\\ - 13h & 0.87 (0.37) & 0.87 (0.38) & 0.86 (0.39)\\ + 10 & 25 & Standard speech processing\\ + 40 & 100 &\\ + 80 & 200 &\\ \bottomrule \end{tabular} - \caption{Multiclass classification results (accuracy - (loss))}\label{tbl:singer} + \caption{\Gls{MFCC} parameter settings} \end{table} -\subsection{Alien data} -To test the generalizability of the models the system is tested on alien data. -The data was retrieved from the album \emph{The Desperation} by \emph{Godless -Truth}. \emph{Godless Truth} is a so called old-school \gls{dm} band that has -very raspy vocals and the vocals are very up front in the mastering. This means -that the vocals are very prevalent in the recording and therefore no difficulty -is expected for the classifier. Figure~\ref{fig:alien1} shows that indeed the -classifier scores very accurately. Note that the spectogram settings have been -adapted a little bit to make the picture more clear. The spectogram shows the -frequency range from $0$ to $3000Hz$. - -\begin{figure}[H] - \centering - \includegraphics[width=.7\linewidth]{alien1}. - \caption{Plotting the classifier under similar alien data}\label{fig:alien1} -\end{figure} - -To really test the limits, a song from the highly atmospheric doom metal band -called \emph{Catacombs} has been tested on the system. The album \emph{Echoes -Through the Catacombs} is an album that has a lot of synthesizers, heavy -droning guitars and bass lines. The vocals are not mixed in a way that makes -them stand out. The models have never seen trainingsdata that is even remotely -similar to this type of metal. Figure~\ref{fig:alien2} shows a segment of the -data. Here it is clearly visible that the classifier can not distinguish -singing from non singing. +\subsection{\emph{Singing}-voice detection} +The first type of experiment conducted is \emph{Singing}-voice detection. This +is the act of segmenting an audio signal into segments that are labeled either +as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a +feature vector and the output is the probability that singing is happening in +the sample. This results in an \gls{ANN} of the shape described in +Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one. -\begin{figure}[H] - \centering - \includegraphics[width=.7\linewidth]{alien2}. - \caption{Plotting the classifier under different alien data}\label{fig:alien2} -\end{figure} +\subsection{\emph{Singer}-voice detection} +The second type of experiment conducted is \emph{Singer}-voice detection. This +is the act of segmenting an audio signal into segments that are labeled either +with the name of the singer or as \emph{Instrumental}. The input of the +classifier is a feature vector and the outputs are probabilities for each of +the singers and a probability for the instrumental label. This results in an +\gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension +is yet again thirteen and the output dimension is the number of categories. The +output is encoded in one-hot encoding. This means that the categories are +labeled as \texttt{1000, 0100, 0010, 0001}. diff --git a/results.tex b/results.tex new file mode 100644 index 0000000..d6cf238 --- /dev/null +++ b/results.tex @@ -0,0 +1,89 @@ +\section{\emph{Singing}-voice detection} +Table~\ref{tbl:singing} shows the results for the singing-voice detection. +Figure~\ref{fig:bclass} shows an example of a segment of a song with the +classifier plotted underneath to illustrate the performance. The performance is +given by the accuracy and loss. The accuracy is the percentage of correctly +classified samples. + +\begin{table}[H] + \centering + \begin{tabular}{rccc} + \toprule + & \multicolumn{3}{c}{Parameters (step/length)}\\ + & 10/25 & 40/100 & 80/200\\ + \midrule + \multirow{4}{*}{Hidden Nodes} + & 0.86 (0.34) & 0.87 (0.32) & 0.85 (0.35)\\ + & 0.87 (0.31) & 0.88 (0.30) & 0.87 (0.32)\\ + & 0.88 (0.30) & 0.88 (0.31) & 0.88 (0.29)\\ + & 0.89 (0.28) & 0.89 (0.29) & 0.88 (0.30)\\ + \bottomrule + \end{tabular} + \caption{Binary classification results (accuracy + (loss))}\label{tbl:singing} +\end{table} + +Plotting the classifier under a segment of the data results in +Figure~\ref{fig:bclass}. + +\begin{figure}[H] + \centering + \includegraphics[width=.7\linewidth]{bclass}. + \caption{Plotting the classifier under the audio signal}\label{fig:bclass} +\end{figure} + +\section{\emph{Singer}-voice detection} +Table~\ref{tbl:singer} shows the results for the singer-voice detection. The +same metrics are used as in \emph{Singing}-voice detection. + +\begin{table}[H] + \centering + \begin{tabular}{rccc} + \toprule + & \multicolumn{3}{c}{Parameters (step/length)}\\ + & 10/25 & 40/100 & 80/200\\ + \midrule + \multirow{4}{*}{Hidden Nodes} + & 0.83 (0.48) & 0.82 (0.48) & 0.82 (0.48)\\ + & 0.85 (0.43) & 0.84 (0.44) & 0.84 (0.44)\\ + & 0.86 (0.41) & 0.86 (0.39) & 0.86 (0.40)\\ + & 0.87 (0.37) & 0.87 (0.38) & 0.86 (0.39)\\ + \bottomrule + \end{tabular} + \caption{Multiclass classification results (accuracy + (loss))}\label{tbl:singer} +\end{table} + +\section{Alien data} +To test the generalizability of the models the system is tested on alien data. +The data was retrieved from the album \emph{The Desperation} by \emph{Godless +Truth}. \emph{Godless Truth} is a so called old-school \gls{dm} band that has +very raspy vocals and the vocals are very up front in the mastering. This means +that the vocals are very prevalent in the recording and therefore no difficulty +is expected for the classifier. Figure~\ref{fig:alien1} shows that indeed the +classifier scores very accurately. Note that the spectogram settings have been +adapted a little bit to make the picture more clear. The spectogram shows the +frequency range from $0$ to $3000Hz$. + +\begin{figure}[H] + \centering + \includegraphics[width=.7\linewidth]{alien1}. + \caption{Plotting the classifier with alien data containing familiar vocal + styles}\label{fig:alien1} +\end{figure} + +To really test the limits, a song from the highly atmospheric doom metal band +called \emph{Catacombs} has been tested on the system. The album \emph{Echoes +Through the Catacombs} is an album that has a lot of synthesizers, heavy +droning guitars and bass lines. The vocals are not mixed in a way that makes +them stand out. The models have never seen trainingsdata that is even remotely +similar to this type of metal. Figure~\ref{fig:alien2} shows a segment of the +data. It is visible that the classifier can not distinguish singing from non +singing. + +\begin{figure}[H] + \centering + \includegraphics[width=.7\linewidth]{alien2}. + \caption{Plotting the classifier with alien data containing strange vocal + styles}\label{fig:alien2} +\end{figure} -- 2.20.1