conclusie
[asr1617.git] / conclusion.tex
1 \section{Conclusion}
2 This study shows that existing techniques for singing-voice detection
3 designed for regular singing-voices also work on \gls{dm} and \gls{dom} that
4 contain extreme singing styles like grunting. With a standard \gls{ANN}
5 classifier using \gls{MFCC} features a performance of $85\%$ can be achieved
6 which is similar to the same techniques used on regular singing. This means
7 that it might also be suitable as a pre-processing step for lyrics forced
8 alignment.
9
10 To determine whether the model generalizes, alien data has been offered to the
11 model to see how it performs. It was shown that for similar singing styles the
12 models perform similar. The alien data offered containing different singing
13 styles, atmospheric noise and accompaniment is classified less good.
14
15 From the results we can conclude that the model generalizes well over the
16 trainings set, even with little hidden nodes. The models with 3 or 5 hidden
17 nodes score a little worse than their bigger brothers but there is hardly any
18 difference between the performance of a model with 8 or 13 nodes. Moreover,
19 contrary than expected the window size does not seem to be doing much in the
20 performance.
21
22 \section{Future research}
23 \paragraph{Forced alignment: }
24 Future interesting research includes doing the actual forced alignment. This
25 probably requires entirely different models. The models used for real speech
26 are probably not suitable because the acoustic properties of a regular
27 singing-voice are very different from those of a growling voice, let alone
28 speech.
29
30 \paragraph{Generalization: }
31 Secondly, it would be interesting if a model could be trained that could
32 discriminate a singing voice for all styles of singing including growling.
33 Moreover, it is possible to investigate the performance of detecting growling
34 on regular singing-voice trained models and the other way around.
35
36 \paragraph{Decorrelation }
37 Another interesting research continuation would be to investigate whether the
38 decorrelation step of the feature extraction is necessary. This transformation
39 might be inefficient or unnatural. The first layer of weights in the model
40 could be seen as a first processing step. If another layer is added that layer
41 could take over the role of the decorrelating. The downside of this is that
42 training the model is tougher because there are a many more weights to train.
43
44 \paragraph{Genre detection: }
45 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
46 way of genre-detection. Therefore it might be interesting to figure out whether
47 this is generalizable to general genre recognition. This requires more data
48 from different genres to be added to the dataset and the models to be
49 retrained.
50
51 \paragraph{\glspl{HMM}: }
52 A lot of similar research on singing-voice detection uses \glspl{HMM} and
53 existing phone models. It would be interesting to try the same approach on
54 extreme singing styles to see whether the phone models can say anything about a
55 growling voice.
56
57 %Discussion section
58 \section{Discussion}
59 The dataset used is not very big. Only three albums are annotated and used
60 as training data. The albums chosen do represent the ends of the spectrum and
61 therefore the resulting model can be very general. However, it could also mean
62 that the model is able to recognize three islands in the entire space of
63 grunting. This does not seem the case since the results show that almost all
64 alien data also has a good performance. However, the data has been picked to
65 represent the edges of the spectrum. While testing \emph{Catacombs} it seemed
66 that this was not the case since the performance was very poor. Adding
67 \emph{Catacombs} or a similar style to the training set can probably overcome
68 this limitation.