From 92673ab8acdf076aa3da1c2b31e29b654d38ccc5 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Thu, 15 Jun 2017 12:32:49 +0200 Subject: [PATCH] process last comments --- asr.pre | 1 + asr.tex | 2 +- conclusion.tex | 65 +++++++++++++++++++++++++++----------------------- methods.tex | 52 +++++++++++++++++++++++++++++----------- results.tex | 11 +++++---- 5 files changed, 82 insertions(+), 49 deletions(-) diff --git a/asr.pre b/asr.pre index fc97983..6a82310 100644 --- a/asr.pre +++ b/asr.pre @@ -4,6 +4,7 @@ \usepackage[utf8]{inputenc} \usepackage[british]{babel} +\usepackage{amsmath} % Extra math functions \usepackage{geometry} % Papersize \usepackage{hyperref} % Hyperlinks \usepackage{booktabs} % Better looking tables diff --git a/asr.tex b/asr.tex index 8e054ae..608eb1e 100644 --- a/asr.tex +++ b/asr.tex @@ -38,7 +38,7 @@ \chapter{Results} \input{results} -\chapter{Conclusion \& Discussion} +\chapter{Discussion \& Conclusion} \input{conclusion} %(Appendices) diff --git a/conclusion.tex b/conclusion.tex index 3cc834f..dac8812 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -1,25 +1,3 @@ -\section{Conclusion} -This study shows that existing techniques for singing-voice detection -designed for regular singing-voices also work on \gls{dm} and \gls{dom} that -contain extreme singing styles like grunting. With a standard \gls{ANN} -classifier using \gls{MFCC} features a performance of $85\%$ can be achieved -which is similar to the same techniques used on regular singing. This means -that it might also be suitable as a pre-processing step for lyrics forced -alignment. Moreover, the \emph{singer}-voice recognition experiments scored -similarly. - -To determine whether the model generalizes, alien data has been offered to the -model to see how it performs. It was shown that for similar singing styles the -models perform similar. The alien data offered containing different singing -styles, atmospheric noise and accompaniment is classified worse. - -From the results we can conclude that the model generalizes well over the -trainings set, even with little hidden nodes. The models with 3 or 5 hidden -nodes score a little worse than their bigger brothers but there is hardly any -difference between the performance of a model with 8 or 13 nodes. Moreover, -contrary than expected the window size does not seem to be doing much in the -performance. - %Discussion section \section{Discussion} The dataset used only contains three albums and might not be considered varied. @@ -28,13 +6,19 @@ Therefore the resulting model can be very general. On the other side, it could also result in a model that is overfitted to the three islands in the entire space of grunting voices. -In this case it seems that the model generalizes well. The alien data --- similar -to the training data --- offered to the model, results in a good performance. -However, alien data that has a very different style does not perform as good. -While testing \emph{Catacombs} the performance was very poor. Adding -\emph{Catacombs} or a similar style to the training set can probably overcome -this performance issue. Thus, the performance on alien data can probably be -increased by having a bigger and more varied dataset. +In this case it seems that the model generalizes well. The alien data --- +similar to the training data --- offered to the model, results in a good +performance. However, alien data that has a very different style does not +perform as good. While testing \emph{Catacombs} the performance was very poor. +Adding \emph{Catacombs} or a similar style to the training set can probably +overcome this performance issue. Thus, the performance on alien data can +probably be increased by having a bigger and more varied dataset that includes +more outliers in the plane of growling voices. + +The performance reached in the experiments is very similar to the literature. +This was expected because growling voices have different spectral +characteristics but are still produced by the vocal tract and physically +limited by it. \section{Future research} \paragraph{Forced alignment: } @@ -74,7 +58,7 @@ be tackled using the methods used. In the literature, similar methods have been used to discriminate genres in regular music styles and it even has been attempted to discriminate genres within extreme music styles. Therefore it might be interesting to figure out whether this specific method is -generalizable to general genre recognition. This requires more data from +generalizable to general genre recognition. This requires more data from different genres to be added to the dataset and the models to be retrained. Again, it would be interesting to see what comes out of the models when offering regular music and the other way around. Maybe the characteristics of @@ -89,3 +73,24 @@ would be interesting to try using existing speech models on singing-voice recognition in extreme singing styles to see whether the phone models can say anything about a growling voice. +\section{Conclusion} +This study shows that existing techniques for singing-voice detection +designed for regular singing-voices also work on \gls{dm} and \gls{dom} that +contain extreme singing styles like grunting. With a standard \gls{ANN} +classifier using \gls{MFCC} features a performance of $85\%$ can be achieved +which is similar to the same techniques used on regular singing. This means +that it might also be suitable as a pre-processing step for lyrics forced +alignment. Moreover, the \emph{singer}-voice recognition experiments scored +similarly. + +To determine whether the model generalizes, alien data has been offered to the +model to see how it performs. It was shown that for similar singing styles the +models perform similar. The alien data offered containing different singing +styles, atmospheric noise and accompaniment is classified worse. + +From the results we can conclude that the model generalizes well over the +trainings set, even with a small number of hidden nodes. The models with 3 or 5 +hidden nodes score a little worse than their bigger brothers but there is +hardly any difference between the performance of a model with 8 or 13 nodes. +Moreover, contrary than expected the window size does not seem to be doing much +in the performance. diff --git a/methods.tex b/methods.tex index 0f694c5..517ade5 100644 --- a/methods.tex +++ b/methods.tex @@ -10,9 +10,9 @@ utilizing \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. Every file is annotated using Praat~\cite{boersma_praat_2002} where the lyrics are manually aligned to the audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the -waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible -that within the genre of death metal there are different spectral patterns -visible over time. +waveform, $1-8000$Hz spectrogram and annotations are shown. It is clearly +visible that within the genre of death metal there are different spectral +patterns visible over time. \begin{figure}[ht] \centering @@ -49,9 +49,10 @@ performs in several Muscovite bands. This band also stands out because it uses piano's and synthesizers. The droning synthesizers often operate in the same frequency as the vocals. -Additional details about the dataset are listed in Appendix~\ref{app:data}. -The data is labeled as singing and instrumental and labeled per band. The -distribution for this is shown in Table~\ref{tbl:distribution}. +Additional details about the dataset such are listed in +Appendix~\ref{app:data}. The data is labeled as singing and instrumental and +labeled per band. The distribution for this is shown in +Table~\ref{tbl:distribution}. \begin{table}[H] \centering \begin{tabular}{lcc} @@ -69,7 +70,7 @@ distribution for this is shown in Table~\ref{tbl:distribution}. 0.59 & 0.16 & 0.19 & 0.06\\ \bottomrule \end{tabular} - \caption{Data distribution}\label{tbl:distribution} + \caption{Proportional data distribution}\label{tbl:distribution} \end{table} \section{Mel-frequencey Cepstral Features} @@ -90,9 +91,7 @@ created from a waveform incrementally using several steps: window with overlap. The width of the window and the step size are two important parameters in the system. In classical phonetic analysis window sizes of $25ms$ with a step of $10ms$ are often chosen because - they are small enough to contain just one subphone event. Singing for - $25ms$ is impossible so it might be necessary to increase the window - size. + they are small enough to contain just one subphone event. \item The standard \gls{FT} gives a spectral representation that has linearly scaled frequencies. This scale is converted to the \gls{MS} using triangular overlapping windows to get a more tonotopic @@ -115,7 +114,7 @@ created from a waveform incrementally using several steps: The default number of \gls{MFCC} parameters is twelve. However, often a thirteenth value is added that represents the energy in the analysis window. The $c_0$ is chosen is this example. $c_0$ is the zeroth \gls{MFCC}. It -represents the overall energy in the \gls{MS}. Another option would be +represents the average over all \gls{MS} bands. Another option would be $\log{(E)}$ which is the logarithm of the raw energy of the sample. \section{Artificial Neural Network} @@ -210,7 +209,22 @@ is the act of segmenting an audio signal into segments that are labeled either as \emph{Singing} or as \emph{Instrumental}. The input of the classifier is a feature vector and the output is the probability that singing is happening in the sample. This results in an \gls{ANN} of the shape described in -Figure~\ref{fig:bcann}. The input dimension is thirteen and the output is one. +Figure~\ref{fig:bcann}. The input dimension is thirteen and the output +dimension is one. + +The \emph{crosstenopy} function is used as the loss function. The +formula is shown in Equation~\ref{eq:bincross} where $p$ is the true +distribution and $q$ is the classification. Acurracy is the mean of the +absolute differences between prediction and true value. The formula is show in +Equation~\ref{eq:binacc}. + +\begin{equation}\label{eq:bincross} + H(p,q) = -\sum_x p(x)\log{q(x)} +\end{equation} + +\begin{equation}\label{eq:binacc} + \frac{1}{n}\sum^n_{i=1} abs (ypred_i-y_i) +\end{equation} \subsection{\emph{Singer}-voice detection} The second type of experiment conducted is \emph{Singer}-voice detection. This @@ -220,5 +234,15 @@ classifier is a feature vector and the outputs are probabilities for each of the singers and a probability for the instrumental label. This results in an \gls{ANN} of the shape described in Figure~\ref{fig:mcann}. The input dimension is yet again thirteen and the output dimension is the number of categories. The -output is encoded in one-hot encoding. This means that the categories are -labeled as \texttt{1000, 0100, 0010, 0001}. +output is encoded in one-hot encoding. This means that the four categories in +the experiments are labeled as \texttt{1000, 0100, 0010, 0001}. + +The loss function is the same as in \emph{Singing}-voice detection. +The accuracy is calculated a little differenty since the output of the network +is not one probability but a vector of probabilities. The accuracy is +calculated of each sample by only taking into account the highest value in the +one-hot encoded vector. This exact formula is shown in Equation~\ref{eq:catacc}. + +\begin{equation}\label{eq:catacc} + \frac{1}{n}\sum^n_{i=1} abs(argmax(ypred_i)-argmax(y_i)) +\end{equation} diff --git a/results.tex b/results.tex index ac27776..1f13895 100644 --- a/results.tex +++ b/results.tex @@ -3,7 +3,8 @@ Table~\ref{tbl:singing} shows the results for the singing-voice detection. The performance is given by the fraction of correctly classified samples (accuracy). The rows represent the count of hidden nodes, the columns represent the analysis window step size and the analysis window length in the \gls{MFCC} -extraction. +extraction. A ceiling effect was observed after two epochs for every hidden +node configuration. \begin{table}[H] \centering @@ -39,7 +40,9 @@ drops. This phenomenon is visible throughout the songs. \section{\emph{Singer}-voice detection} Table~\ref{tbl:singer} shows the results for the singer-voice detection. The -same metrics are used as in \emph{Singing}-voice detection. +same metrics are used as in \emph{Singing}-voice detection. In these +experiment a ceiling effect was observed after two to three epochs for every +hidden node configuration. \begin{table}[H] \centering @@ -89,6 +92,6 @@ singing. \begin{figure}[H] \centering \includegraphics[width=.7\linewidth]{alien2}. - \caption{Plotting the classifier with alien data containing strange vocal - styles}\label{fig:alien2} + \caption{Plotting the classifier with alien data containing unobserved + vocal styles}\label{fig:alien2} \end{figure} -- 2.20.1