true final

[asr1617.git] / conclusion.tex
diff --git a/conclusion.tex b/conclusion.tex

index 9d5176f..c2ea759 100644 (file)
--- a/conclusion.tex
+++ b/conclusion.tex
@@ -1,18 +1,96 @@
-\section{Conclusion}
-This research shows that existing techniques for singing-voice detection
-designed for regular singing voices also work respectably on extreme singing
-styles like grunting. With a standard \gls{ANN} classifier using \gls{MFCC}
-features a performance of $85\%$ can be achieved. When applying smoothing this
-can be increased until\todo{results}.
-
  %Discussion section
  \section{Discussion}
-Singing-voice detection can be seen as a crude way of
-genre-discrimination.\todo{finish}
+The dataset used only contains three albums and might not be considered varied.
+However, the albums are picked to represent the ends of the growling spectrum.
+Therefore the resulting model can be very general. On the other side, it could
+also result in a model that is overfitted to the three islands in the entire
+space of grunting voices.
+
+In this case it seems that the model generalizes well. The alien data ---
+similar to the training data --- offered to the model, results in a good
+performance. However, alien data that has a very different style does not
+perform as good. While testing \emph{Catacombs} the performance was very poor.
+Adding \emph{Catacombs} or a similar style to the training set can probably
+overcome this performance issue. Thus, the performance on alien data can
+probably be increased by having a bigger and more varied dataset that includes
+more outliers in the plane of growling voices.
+
+The performance reached in the experiments is very similar to the literature.
+This was expected because growling voices have different spectral
+characteristics but are still produced by the vocal tract and physically
+limited by it.
+
+\section{Future research}
+\paragraph{Forced alignment: }
+Future interesting research includes doing the actual forced alignment. It was
+found that pre segmenting the audio made lyrics forced alignment easier.
+Attempting this will require different models because the models are not based
+on phones.
+
+Growling voices are acoustically very different than regular singing voices.
+Therefore, regular phonetic models created for speech will probably not be
+useful and new models must be made when attempting forced alignment.
+
+\paragraph{Generalization: }
+Secondly, it would be interesting if a model could be trained that could
+discriminate a singing voice for all styles of singing including growling. This
+can be done by training the models also on regular singing styles.
+
+To really explore the limits of the methods it would be interesting to
+investigate the performance of detecting one with the other. This means using
+existing models that were trained on regular singing voices to detect grunting.
+The same experiments can be done the other way around as well.
+
+\paragraph{Decorrelation: }
+Adding another layer to the \gls{MLP} can be seen as applying an extra
+normalization step to the input data. It could be that the last step in
+converting the waveforms to \gls{MFCC} can be performed by the neural network.
+The current decorrelation step might be inefficient or unnatural. The \gls{ANN}
+train the weights in such a way that performance is maximized. It would be
+interesting to see whether this results in a different normalization step. The
+downside of this is that training the model is more complex because there are
+many more weights to train.
+
+\paragraph{Genre detection: }
+\emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
+way of genre-detection. The results have shown that this is a problem that can
+be tackled using the methods used. In the literature, similar methods have been
+used to discriminate genres in regular music styles and it even has been
+attempted to discriminate genres within extreme music styles. Therefore it
+might be interesting to figure out whether this specific method is
+generalizable to general genre recognition. This requires more data from
+different genres to be added to the dataset and the models to be retrained.
+Again, it would be interesting to see what comes out of the models when
+offering regular music and the other way around. Maybe the characteristics of
+some regular music genres are similar to those in extreme music genres.
+
+\paragraph{\glspl{HMM}: }
+A lot of similar research doing singing-voice detection has used \glspl{HMM} as
+the basis for the models. Moreover, some have used existing --- speech
+trained --- phone models directly to discriminate music from non music. A
+\gls{HMM} approach would probably perform similar to the current method. It
+would be interesting to try using existing speech models on singing-voice
+recognition in extreme singing styles to see whether the phone models can say
+anything about a growling voice.
+
+\section{Conclusion}
+This study shows that existing techniques for singing-voice detection
+designed for regular singing-voices also work on \gls{dm} and \gls{dom} that
+contain extreme singing styles like grunting. With a standard \gls{ANN}
+classifier using \gls{MFCC} features a performance of $85\%$ can be achieved
+which is similar to the same techniques used on regular singing. This means
+that it might also be suitable as a pre-processing step for lyrics forced
+alignment. Moreover, the \emph{singer}-voice recognition experiments scored
+similarly.
  
-\todo{Novelty}
-\todo{Weaknesses}
-\todo{Dataset is not very varied but\ldots}
+To determine whether the model generalizes, alien data has been offered to the
+model to see how it performs. It was shown that for similar singing styles the
+models perform similar. The alien data offered containing different singing
+styles, atmospheric noise and accompaniment is classified worse.
  
-\todo{Doom metal}
-%Conclusion section
+From the results we can conclude that the model generalizes well over the
+trainings set, even with a small number of hidden nodes. The models with 3 or 5
+hidden nodes score a little worse than their bigger brothers but there is
+hardly any difference between the performance of a model with 8 or 13 nodes.
+Moreover, contrary than expected the window size does not seem to be doing much
+in the performance.