own titlepage with correct logo
[asr1617.git] / conclusion.tex
1 \section{Conclusion}
2 This study shows that existing techniques for singing-voice detection
3 designed for regular singing-voices also work on \gls{dm} and \gls{dom} that
4 contain extreme singing styles like grunting. With a standard \gls{ANN}
5 classifier using \gls{MFCC} features a performance of $85\%$ can be achieved
6 which is similar to the same techniques used on regular singing. This means
7 that it might also be suitable as a pre-processing step for lyrics forced
8 alignment. Moreover, the \emph{singer}-voice recognition experiments scored
9 similarly.
10
11 To determine whether the model generalizes, alien data has been offered to the
12 model to see how it performs. It was shown that for similar singing styles the
13 models perform similar. The alien data offered containing different singing
14 styles, atmospheric noise and accompaniment is classified less good.
15
16 From the results we can conclude that the model generalizes well over the
17 trainings set, even with little hidden nodes. The models with 3 or 5 hidden
18 nodes score a little worse than their bigger brothers but there is hardly any
19 difference between the performance of a model with 8 or 13 nodes. Moreover,
20 contrary than expected the window size does not seem to be doing much in the
21 performance.
22
23 %Discussion section
24 \section{Discussion}
25 The dataset used only contains three albums and might not be considered varied.
26 However, the albums are picked to represent the ends of the growling spectrum.
27 Therefore the resulting model can be very general. On the other side, it could
28 also result in a model that is overfitted the three islands in entire space of
29 grunting voices.
30
31 In this case it seems that the model generalizes well. The alien data --- similar
32 to the training data --- offered to the model, results in a good performance.
33 However, alien data that has a very different style does not perform as good.
34 While testing \emph{Catacombs} the performance was very poor. Adding
35 \emph{Catacombs} or a similar style to the training set can probably overcome
36 this performance issue. Thus, the performance on alien data can probably be
37 increased by having a bigger and more varied dataset.
38
39 \section{Future research}
40 \paragraph{Forced alignment: }
41 Future interesting research includes doing the actual forced alignment. It was
42 found that pre segmenting the audio made lyrics forced alignment easier.
43 Attempting this will require different models because the models are not based
44 on phones.
45
46 Growling voices are acoustically very different than regular singing voices.
47 Therefore, regular phonetic models created for speech will probably not be
48 useful and new models must be made when attempting forced alignment.
49
50 \paragraph{Generalization: }
51 Secondly, it would be interesting if a model could be trained that could
52 discriminate a singing voice for all styles of singing including growling. This
53 can be done by training the models also on regular singing styles.
54
55 To really explore the limits of the methods it would be interesting to
56 investigate the performance of detecting one with the other. This means using
57 existing models that were trained on regular singing voices to detect grunting.
58 The same experiments can be done the other way around as well.
59
60 \paragraph{Decorrelation }
61 Adding another layer to the \gls{MLP} can be seen as applying an extra
62 normalization step to the input data. It could be that the last step in
63 converting the waveforms to \gls{MFCC} can be performed by the neural network.
64 The current decorrelation step might be inefficient or unnatural. The \gls{ANN}
65 train the weights in such a way that performance is maximized. It would be
66 interesting to see whether this results in a different normalization step. The
67 downside of this is that training the model is more complex because there are
68 many more weights to train.
69
70 \paragraph{Genre detection: }
71 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
72 way of genre-detection. The results have shown that this is a problem that can
73 be tackled using the methods used. In the literature, similar methods have been
74 used to discriminate genres in regular music styles and it even has been
75 attempted to discriminate genres within extreme music styles. Therefore it
76 might be interesting to figure out whether this specific method is
77 generalizable to general genre recognition. This requires more data from
78 different genres to be added to the dataset and the models to be retrained.
79 Again, it would be interesting to see what comes out of the models when
80 offering regular music and the other way around. Maybe the characteristics of
81 some regular music genres are similar to those in extreme music genres.
82
83 \paragraph{\glspl{HMM}: }
84 A lot of similar research doing singing-voice detection has used \glspl{HMM} as
85 the basis for the models. Moreover, some have used existing --- speech
86 trained --- phone models directly to discriminate music from non music. A
87 \gls{HMM} approach would probably perform similar to the current method. It
88 would be interesting to try using existing speech models on singing-voice
89 recognition in extreme singing styles to see whether the phone models can say
90 anything about a growling voice.
91