classifier using \gls{MFCC} features a performance of $85\%$ can be achieved
which is similar to the same techniques used on regular singing. This means
that it might also be suitable as a pre-processing step for lyrics forced
-alignment.
+alignment. Moreover, the \emph{singer}-voice recognition experiments scored
+similarly.
To determine whether the model generalizes, alien data has been offered to the
model to see how it performs. It was shown that for similar singing styles the
contrary than expected the window size does not seem to be doing much in the
performance.
+%Discussion section
+\section{Discussion}
+The dataset used only contains three albums and might not be considered varied.
+However, the albums are picked to represent the ends of the growling spectrum.
+Therefore the resulting model can be very general. On the other side, it could
+also result in a model that is overfitted the three islands in entire space of
+grunting voices.
+
+In this case it seems that the model generalizes well. The alien data similar
+to the trainingsdata offered to the model results in a good performance.
+However, alien data that has a very different style does not perform as good.
+While testing \emph{Catacombs} the performance was very poor. Adding
+\emph{Catacombs} or a similar style to the training set can probably overcome
+this performance issue. Thus, the performance on alien data can probably be
+increased by having a bigger and more varied dataset.
+
\section{Future research}
\paragraph{Forced alignment: }
-Future interesting research includes doing the actual forced alignment. This
-probably requires entirely different models. The models used for real speech
-are probably not suitable because the acoustic properties of a regular
-singing-voice are very different from those of a growling voice, let alone
-speech.
+Future interesting research includes doing the actual forced alignment. It was
+found that pre segmenting the audio made lyrics forced alignment easier.
+Attempting this will require different models because the models are not based
+on phones.
+
+Growling voices are acoustically very different than regular singing voices.
+Therefore, regular phonetic models created for speech will probably not be
+useful and new models must be made when attempting forced alignment.
\paragraph{Generalization: }
Secondly, it would be interesting if a model could be trained that could
-discriminate a singing voice for all styles of singing including growling.
-Moreover, it is possible to investigate the performance of detecting growling
-on regular singing-voice trained models and the other way around.
+discriminate a singing voice for all styles of singing including growling. This
+can be done by training the models also on regular singing styles.
+
+To really explore the limits of the methods it would be interesting to
+investigate the performance of detecting one with the other. This means using
+existing models that were trained on regular singing voices to detect grunting.
+The same experiments can be done the other way around as well.
\paragraph{Decorrelation }
-Another interesting research continuation would be to investigate whether the
-decorrelation step of the feature extraction is necessary. This transformation
-might be inefficient or unnatural. The first layer of weights in the model
-could be seen as a first processing step. If another layer is added that layer
-could take over the role of the decorrelating. The downside of this is that
-training the model is tougher because there are a many more weights to train.
+Adding another layer to the \gls{MLP} can be seen as applying an extra
+normalization step to the input data. It could be that the last step in
+converting the waveforms to \gls{MFCC} can be performed by the neural network.
+The current decorrelation step might be inefficient or unnatural. The \gls{ANN}
+train the weights in such a way that performance is maximized. It would be
+interesting to see whether this results in a different normalization step. The
+downside of this is that training the model is complexer because there are many
+more weights to train.
\paragraph{Genre detection: }
\emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
-way of genre-detection. Therefore it might be interesting to figure out whether
-this is generalizable to general genre recognition. This requires more data
-from different genres to be added to the dataset and the models to be
-retrained.
+way of genre-detection. The results have shown that this is a problem that can
+be tackled using the methods used. In the literature, similar methods have been
+used to discriminate genres in regular music styles and it even has been
+attempted to discriminate genres within extreme music styles. Therefore it
+might be interesting to figure out whether this specific method is
+generalizable to general genre recognition. This requires more data from
+different genres to be added to the dataset and the models to be retrained.
+Again, it would be interesting to see what comes out of the models when
+offering regular music and the other way around. Maybe the characteristics of
+some regular music genres are similar to those in extreme music genres.
\paragraph{\glspl{HMM}: }
-A lot of similar research on singing-voice detection uses \glspl{HMM} and
-existing phone models. It would be interesting to try the same approach on
-extreme singing styles to see whether the phone models can say anything about a
-growling voice.
+A lot of similar research doing singing-voice detection has used \glspl{HMM} as
+the basis for the models. Moreover, some have used existing --- speech
+trained --- phone models directly to discriminate music from non music. A
+\gls{HMM} approach would probably perform similar to the current method. It
+would be interesting to try using existing speech models on singing-voice
+recognition in extreme singing styles to see whether the phone models can say
+anything about a growling voice.
-%Discussion section
-\section{Discussion}
-The dataset used is not very big. Only three albums are annotated and used
-as training data. The albums chosen do represent the ends of the spectrum and
-therefore the resulting model can be very general. However, it could also mean
-that the model is able to recognize three islands in the entire space of
-grunting. This does not seem the case since the results show that almost all
-alien data also has a good performance. However, the data has been picked to
-represent the edges of the spectrum. While testing \emph{Catacombs} it seemed
-that this was not the case since the performance was very poor. Adding
-\emph{Catacombs} or a similar style to the training set can probably overcome
-this limitation.
\section{\emph{Singing}-voice detection}
Table~\ref{tbl:singing} shows the results for the singing-voice detection. The
-performance is given by the accuracy (and loss). The accuracy is the percentage
-of correctly classified samples.
-
-Figure~\ref{fig:bclass} shows an example of a segment of a song with the
-classifier plotted underneath. For this illustration the $13$ node model is
-used with a analysis window size and step of $40$ and $100ms$ respectively. The
-output is smoothed using a hanning window.
+performance is given by the fraction of correctly classified samples
+(accuracy). The rows represent the count of hidden nodes, the columns represent
+the analysis window step size and the analysis window length in the \gls{MFCC}
+extraction.
\begin{table}[H]
\centering
\label{tbl:singing}
\end{table}
+Figure~\ref{fig:bclass} shows an example of a segment of a song with the
+classifier plotted underneath. For this illustration the $13$ node model is
+used with a analysis window size and step of $40$ and $100$ respectively. The
+output is smoothed using a hanning window. This figure shows that the model
+focusses on the frequencies around $300Hz$ which contain the growling. When
+there is a little silence in between the growls the classifier immediately
+drops. This phenomenon is visible throughout the songs.
+
\begin{figure}[H]
\centering
\includegraphics[width=1\linewidth]{bclass}