true final
[asr1617.git] / conclusion.tex
1 %Discussion section
2 \section{Discussion}
3 The dataset used only contains three albums and might not be considered varied.
4 However, the albums are picked to represent the ends of the growling spectrum.
5 Therefore the resulting model can be very general. On the other side, it could
6 also result in a model that is overfitted to the three islands in the entire
7 space of grunting voices.
8
9 In this case it seems that the model generalizes well. The alien data ---
10 similar to the training data --- offered to the model, results in a good
11 performance. However, alien data that has a very different style does not
12 perform as good. While testing \emph{Catacombs} the performance was very poor.
13 Adding \emph{Catacombs} or a similar style to the training set can probably
14 overcome this performance issue. Thus, the performance on alien data can
15 probably be increased by having a bigger and more varied dataset that includes
16 more outliers in the plane of growling voices.
17
18 The performance reached in the experiments is very similar to the literature.
19 This was expected because growling voices have different spectral
20 characteristics but are still produced by the vocal tract and physically
21 limited by it.
22
23 \section{Future research}
24 \paragraph{Forced alignment: }
25 Future interesting research includes doing the actual forced alignment. It was
26 found that pre segmenting the audio made lyrics forced alignment easier.
27 Attempting this will require different models because the models are not based
28 on phones.
29
30 Growling voices are acoustically very different than regular singing voices.
31 Therefore, regular phonetic models created for speech will probably not be
32 useful and new models must be made when attempting forced alignment.
33
34 \paragraph{Generalization: }
35 Secondly, it would be interesting if a model could be trained that could
36 discriminate a singing voice for all styles of singing including growling. This
37 can be done by training the models also on regular singing styles.
38
39 To really explore the limits of the methods it would be interesting to
40 investigate the performance of detecting one with the other. This means using
41 existing models that were trained on regular singing voices to detect grunting.
42 The same experiments can be done the other way around as well.
43
44 \paragraph{Decorrelation: }
45 Adding another layer to the \gls{MLP} can be seen as applying an extra
46 normalization step to the input data. It could be that the last step in
47 converting the waveforms to \gls{MFCC} can be performed by the neural network.
48 The current decorrelation step might be inefficient or unnatural. The \gls{ANN}
49 train the weights in such a way that performance is maximized. It would be
50 interesting to see whether this results in a different normalization step. The
51 downside of this is that training the model is more complex because there are
52 many more weights to train.
53
54 \paragraph{Genre detection: }
55 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
56 way of genre-detection. The results have shown that this is a problem that can
57 be tackled using the methods used. In the literature, similar methods have been
58 used to discriminate genres in regular music styles and it even has been
59 attempted to discriminate genres within extreme music styles. Therefore it
60 might be interesting to figure out whether this specific method is
61 generalizable to general genre recognition. This requires more data from
62 different genres to be added to the dataset and the models to be retrained.
63 Again, it would be interesting to see what comes out of the models when
64 offering regular music and the other way around. Maybe the characteristics of
65 some regular music genres are similar to those in extreme music genres.
66
67 \paragraph{\glspl{HMM}: }
68 A lot of similar research doing singing-voice detection has used \glspl{HMM} as
69 the basis for the models. Moreover, some have used existing --- speech
70 trained --- phone models directly to discriminate music from non music. A
71 \gls{HMM} approach would probably perform similar to the current method. It
72 would be interesting to try using existing speech models on singing-voice
73 recognition in extreme singing styles to see whether the phone models can say
74 anything about a growling voice.
75
76 \section{Conclusion}
77 This study shows that existing techniques for singing-voice detection
78 designed for regular singing-voices also work on \gls{dm} and \gls{dom} that
79 contain extreme singing styles like grunting. With a standard \gls{ANN}
80 classifier using \gls{MFCC} features a performance of $85\%$ can be achieved
81 which is similar to the same techniques used on regular singing. This means
82 that it might also be suitable as a pre-processing step for lyrics forced
83 alignment. Moreover, the \emph{singer}-voice recognition experiments scored
84 similarly.
85
86 To determine whether the model generalizes, alien data has been offered to the
87 model to see how it performs. It was shown that for similar singing styles the
88 models perform similar. The alien data offered containing different singing
89 styles, atmospheric noise and accompaniment is classified worse.
90
91 From the results we can conclude that the model generalizes well over the
92 trainings set, even with a small number of hidden nodes. The models with 3 or 5
93 hidden nodes score a little worse than their bigger brothers but there is
94 hardly any difference between the performance of a model with 8 or 13 nodes.
95 Moreover, contrary than expected the window size does not seem to be doing much
96 in the performance.