process comments of proofread
[asr1617.git] / conclusion.tex
1 \section{Conclusion \& Future Research}
2 This study shows that existing techniques for singing-voice detection
3 designed for regular singing-voices also work respectably on extreme singing
4 styles like grunting. With a standard \gls{ANN} classifier using \gls{MFCC}
5 features a performance of $85\%$ can be achieved which is similar to the same
6 techniques on regular singing. This means that it might be suitable as a
7 pre-processing step for lyrics forced alignment. The model performs pretty well
8 on alien data that uses similar singing techniques as the training set.
9 However, the model does not cope very well with different singing techniques or
10 with data that contains a lot of atmospheric noise and accompaniment.
11
12 \subsection{Future research}
13 \paragraph{Forced aligment: }
14 Future interesting research includes doing the actual forced alignment. This
15 probably requires entirely different models. The models used for real speech
16 are probably not suitable because the acoustic properties of a regular
17 singing-voice are very different from those of a growling voice, let alone
18 speech.
19
20 \paragraph{Generalization: }
21 Secondly, it would be interesting if a model could be trained that could
22 discriminate a singing voice for all styles of singing including growling.
23 Moreover, it is possible to investigate the performance of detecting growling
24 on regular singing-voice trained models and the other way around.
25
26 \paragraph{Decorrelation }
27 Another interesting research continuation would be to investigate whether the
28 decorrelation step of the feature extraction is necessary. This transformation
29 might be inefficient or unnatural. The first layer of weights in the model
30 could be seen as a first processing step. If another layer is added that layer
31 could take over the role of the decorrelating. The downside of this is that
32 training the model is tougher because there are a many more weights to train.
33
34 \paragraph{Genre detection: }
35 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
36 way of genre-detection. Therefore it might be interesting to figure out whether
37 this is generalizable to general genre recognition. This requires more data
38 from different genres to be added to the dataset and the models to be
39 retrained.
40
41 \paragraph{\glspl{HMM}: }
42 A lot of similar research on singing-voice detection uses \glspl{HMM} and
43 existing phone models. It would be interesting to try the same approach on
44 extreme singing styles to see whether the phone models can say anything about a
45 growling voice.
46
47 %Discussion section
48 \section{Discussion}
49 The dataset used is not very big. Only three albums are annotated and used
50 as training data. The albums chosen do represent the ends of the spectrum and
51 therefore the resulting model can be very general. However, it could also mean
52 that the model is able to recognize three islands in the entire space of
53 grunting. This does not seem the case since the results show that almost all
54 alien data also has a good performance. However, the data has been picked to
55 represent the edges of the spectrum. While testing \emph{Catacombs} it seemed
56 that this was not the case since the performance was very poor. Adding
57 \emph{Catacombs} or a similar style to the training set can probably overcome
58 this limitation.