From b457899616085d201dfb3f7ee219d7b6d9bed413 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Wed, 7 Jun 2017 19:35:53 +0200 Subject: [PATCH] elaborate on genre recoginition --- conclusion.tex | 93 +++++++++++++++++++++++++++++++------------------- results.tex | 19 +++++++---- 2 files changed, 70 insertions(+), 42 deletions(-) diff --git a/conclusion.tex b/conclusion.tex index 700b5e8..778b4fc 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -5,7 +5,8 @@ contain extreme singing styles like grunting. With a standard \gls{ANN} classifier using \gls{MFCC} features a performance of $85\%$ can be achieved which is similar to the same techniques used on regular singing. This means that it might also be suitable as a pre-processing step for lyrics forced -alignment. +alignment. Moreover, the \emph{singer}-voice recognition experiments scored +similarly. To determine whether the model generalizes, alien data has been offered to the model to see how it performs. It was shown that for similar singing styles the @@ -19,50 +20,72 @@ difference between the performance of a model with 8 or 13 nodes. Moreover, contrary than expected the window size does not seem to be doing much in the performance. +%Discussion section +\section{Discussion} +The dataset used only contains three albums and might not be considered varied. +However, the albums are picked to represent the ends of the growling spectrum. +Therefore the resulting model can be very general. On the other side, it could +also result in a model that is overfitted the three islands in entire space of +grunting voices. + +In this case it seems that the model generalizes well. The alien data similar +to the trainingsdata offered to the model results in a good performance. +However, alien data that has a very different style does not perform as good. +While testing \emph{Catacombs} the performance was very poor. Adding +\emph{Catacombs} or a similar style to the training set can probably overcome +this performance issue. Thus, the performance on alien data can probably be +increased by having a bigger and more varied dataset. + \section{Future research} \paragraph{Forced alignment: } -Future interesting research includes doing the actual forced alignment. This -probably requires entirely different models. The models used for real speech -are probably not suitable because the acoustic properties of a regular -singing-voice are very different from those of a growling voice, let alone -speech. +Future interesting research includes doing the actual forced alignment. It was +found that pre segmenting the audio made lyrics forced alignment easier. +Attempting this will require different models because the models are not based +on phones. + +Growling voices are acoustically very different than regular singing voices. +Therefore, regular phonetic models created for speech will probably not be +useful and new models must be made when attempting forced alignment. \paragraph{Generalization: } Secondly, it would be interesting if a model could be trained that could -discriminate a singing voice for all styles of singing including growling. -Moreover, it is possible to investigate the performance of detecting growling -on regular singing-voice trained models and the other way around. +discriminate a singing voice for all styles of singing including growling. This +can be done by training the models also on regular singing styles. + +To really explore the limits of the methods it would be interesting to +investigate the performance of detecting one with the other. This means using +existing models that were trained on regular singing voices to detect grunting. +The same experiments can be done the other way around as well. \paragraph{Decorrelation } -Another interesting research continuation would be to investigate whether the -decorrelation step of the feature extraction is necessary. This transformation -might be inefficient or unnatural. The first layer of weights in the model -could be seen as a first processing step. If another layer is added that layer -could take over the role of the decorrelating. The downside of this is that -training the model is tougher because there are a many more weights to train. +Adding another layer to the \gls{MLP} can be seen as applying an extra +normalization step to the input data. It could be that the last step in +converting the waveforms to \gls{MFCC} can be performed by the neural network. +The current decorrelation step might be inefficient or unnatural. The \gls{ANN} +train the weights in such a way that performance is maximized. It would be +interesting to see whether this results in a different normalization step. The +downside of this is that training the model is complexer because there are many +more weights to train. \paragraph{Genre detection: } \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude -way of genre-detection. Therefore it might be interesting to figure out whether -this is generalizable to general genre recognition. This requires more data -from different genres to be added to the dataset and the models to be -retrained. +way of genre-detection. The results have shown that this is a problem that can +be tackled using the methods used. In the literature, similar methods have been +used to discriminate genres in regular music styles and it even has been +attempted to discriminate genres within extreme music styles. Therefore it +might be interesting to figure out whether this specific method is +generalizable to general genre recognition. This requires more data from +different genres to be added to the dataset and the models to be retrained. +Again, it would be interesting to see what comes out of the models when +offering regular music and the other way around. Maybe the characteristics of +some regular music genres are similar to those in extreme music genres. \paragraph{\glspl{HMM}: } -A lot of similar research on singing-voice detection uses \glspl{HMM} and -existing phone models. It would be interesting to try the same approach on -extreme singing styles to see whether the phone models can say anything about a -growling voice. +A lot of similar research doing singing-voice detection has used \glspl{HMM} as +the basis for the models. Moreover, some have used existing --- speech +trained --- phone models directly to discriminate music from non music. A +\gls{HMM} approach would probably perform similar to the current method. It +would be interesting to try using existing speech models on singing-voice +recognition in extreme singing styles to see whether the phone models can say +anything about a growling voice. -%Discussion section -\section{Discussion} -The dataset used is not very big. Only three albums are annotated and used -as training data. The albums chosen do represent the ends of the spectrum and -therefore the resulting model can be very general. However, it could also mean -that the model is able to recognize three islands in the entire space of -grunting. This does not seem the case since the results show that almost all -alien data also has a good performance. However, the data has been picked to -represent the edges of the spectrum. While testing \emph{Catacombs} it seemed -that this was not the case since the performance was very poor. Adding -\emph{Catacombs} or a similar style to the training set can probably overcome -this limitation. diff --git a/results.tex b/results.tex index bb05b6b..57f6779 100644 --- a/results.tex +++ b/results.tex @@ -1,12 +1,9 @@ \section{\emph{Singing}-voice detection} Table~\ref{tbl:singing} shows the results for the singing-voice detection. The -performance is given by the accuracy (and loss). The accuracy is the percentage -of correctly classified samples. - -Figure~\ref{fig:bclass} shows an example of a segment of a song with the -classifier plotted underneath. For this illustration the $13$ node model is -used with a analysis window size and step of $40$ and $100ms$ respectively. The -output is smoothed using a hanning window. +performance is given by the fraction of correctly classified samples +(accuracy). The rows represent the count of hidden nodes, the columns represent +the analysis window step size and the analysis window length in the \gls{MFCC} +extraction. \begin{table}[H] \centering @@ -26,6 +23,14 @@ output is smoothed using a hanning window. \label{tbl:singing} \end{table} +Figure~\ref{fig:bclass} shows an example of a segment of a song with the +classifier plotted underneath. For this illustration the $13$ node model is +used with a analysis window size and step of $40$ and $100$ respectively. The +output is smoothed using a hanning window. This figure shows that the model +focusses on the frequencies around $300Hz$ which contain the growling. When +there is a little silence in between the growls the classifier immediately +drops. This phenomenon is visible throughout the songs. + \begin{figure}[H] \centering \includegraphics[width=1\linewidth]{bclass} -- 2.20.1