split out results, process comments
[asr1617.git] / conclusion.tex
1 \section{Conclusion}
2 This study shows that existing techniques for singing-voice detection
3 designed for regular singing-voices also work respectably on extreme singing
4 styles like grunting. With a standard \gls{ANN} classifier using \gls{MFCC}
5 features a performance of $85\%$ can be achieved which is similar to the same
6 techniques on regular singing. This means that it might be suitable as a
7 pre-processing step for lyrics forced alignment. The model performs pretty well
8 on alien data that uses similar singing techniques as the training set.
9 However, the model does not cope very well with different singing techniques or
10 with data that contains a lot of atmospheric noise and accompaniment.
11
12 From the results we conclude that the model generalizes well over the trainings
13 set, even with little hidden nodes. The models with 3 or 5 hidden nodes score a
14 little worse than their bigger brothers but there is hardly any difference
15 between the performance of a model with 8 or 13 nodes. Moreover, contrary than
16 expected the window size does not seem to be doing much in the performance.
17
18 \subsection{Future research}
19 \paragraph{Forced aligment: }
20 Future interesting research includes doing the actual forced alignment. This
21 probably requires entirely different models. The models used for real speech
22 are probably not suitable because the acoustic properties of a regular
23 singing-voice are very different from those of a growling voice, let alone
24 speech.
25
26 \paragraph{Generalization: }
27 Secondly, it would be interesting if a model could be trained that could
28 discriminate a singing voice for all styles of singing including growling.
29 Moreover, it is possible to investigate the performance of detecting growling
30 on regular singing-voice trained models and the other way around.
31
32 \paragraph{Decorrelation }
33 Another interesting research continuation would be to investigate whether the
34 decorrelation step of the feature extraction is necessary. This transformation
35 might be inefficient or unnatural. The first layer of weights in the model
36 could be seen as a first processing step. If another layer is added that layer
37 could take over the role of the decorrelating. The downside of this is that
38 training the model is tougher because there are a many more weights to train.
39
40 \paragraph{Genre detection: }
41 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
42 way of genre-detection. Therefore it might be interesting to figure out whether
43 this is generalizable to general genre recognition. This requires more data
44 from different genres to be added to the dataset and the models to be
45 retrained.
46
47 \paragraph{\glspl{HMM}: }
48 A lot of similar research on singing-voice detection uses \glspl{HMM} and
49 existing phone models. It would be interesting to try the same approach on
50 extreme singing styles to see whether the phone models can say anything about a
51 growling voice.
52
53 %Discussion section
54 \section{Discussion}
55 The dataset used is not very big. Only three albums are annotated and used
56 as training data. The albums chosen do represent the ends of the spectrum and
57 therefore the resulting model can be very general. However, it could also mean
58 that the model is able to recognize three islands in the entire space of
59 grunting. This does not seem the case since the results show that almost all
60 alien data also has a good performance. However, the data has been picked to
61 represent the edges of the spectrum. While testing \emph{Catacombs} it seemed
62 that this was not the case since the performance was very poor. Adding
63 \emph{Catacombs} or a similar style to the training set can probably overcome
64 this limitation.