conclusion.tex

   1 %Discussion section
   2 \section{Discussion}
   3 The dataset used only contains three albums and might not be considered varied.
   4 However, the albums are picked to represent the ends of the growling spectrum.
   5 Therefore the resulting model can be very general. On the other side, it could
   6 also result in a model that is overfitted to the three islands in the entire
   7 space of grunting voices.
   8
   9 In this case it seems that the model generalizes well. The alien data ---
  10 similar to the training data --- offered to the model, results in a good
  11 performance. However, alien data that has a very different style does not
  12 perform as good. While testing \emph{Catacombs} the performance was very poor.
  13 Adding \emph{Catacombs} or a similar style to the training set can probably
  14 overcome this performance issue. Thus, the performance on alien data can
  15 probably be increased by having a bigger and more varied dataset that includes
  16 more outliers in the plane of growling voices.
  17
  18 The performance reached in the experiments is very similar to the literature.
  19 This was expected because growling voices have different spectral
  20 characteristics but are still produced by the vocal tract and physically
  21 limited by it.
  22
  23 \section{Future research}
  24 \paragraph{Forced alignment: }
  25 Future interesting research includes doing the actual forced alignment. It was
  26 found that pre segmenting the audio made lyrics forced alignment easier.
  27 Attempting this will require different models because the models are not based
  28 on phones.
  29
  30 Growling voices are acoustically very different than regular singing voices.
  31 Therefore, regular phonetic models created for speech will probably not be
  32 useful and new models must be made when attempting forced alignment.
  33
  34 \paragraph{Generalization: }
  35 Secondly, it would be interesting if a model could be trained that could
  36 discriminate a singing voice for all styles of singing including growling. This
  37 can be done by training the models also on regular singing styles.
  38
  39 To really explore the limits of the methods it would be interesting to
  40 investigate the performance of detecting one with the other. This means using
  41 existing models that were trained on regular singing voices to detect grunting.
  42 The same experiments can be done the other way around as well.
  43
  44 \paragraph{Decorrelation: }
  45 Adding another layer to the \gls{MLP} can be seen as applying an extra
  46 normalization step to the input data. It could be that the last step in
  47 converting the waveforms to \gls{MFCC} can be performed by the neural network.
  48 The current decorrelation step might be inefficient or unnatural. The \gls{ANN}
  49 train the weights in such a way that performance is maximized. It would be
  50 interesting to see whether this results in a different normalization step. The
  51 downside of this is that training the model is more complex because there are
  52 many more weights to train.
  53
  54 \paragraph{Genre detection: }
  55 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
  56 way of genre-detection. The results have shown that this is a problem that can
  57 be tackled using the methods used. In the literature, similar methods have been
  58 used to discriminate genres in regular music styles and it even has been
  59 attempted to discriminate genres within extreme music styles. Therefore it
  60 might be interesting to figure out whether this specific method is
  61 generalizable to general genre recognition. This requires more data from
  62 different genres to be added to the dataset and the models to be retrained.
  63 Again, it would be interesting to see what comes out of the models when
  64 offering regular music and the other way around. Maybe the characteristics of
  65 some regular music genres are similar to those in extreme music genres.
  66
  67 \paragraph{\glspl{HMM}: }
  68 A lot of similar research doing singing-voice detection has used \glspl{HMM} as
  69 the basis for the models. Moreover, some have used existing --- speech
  70 trained --- phone models directly to discriminate music from non music. A
  71 \gls{HMM} approach would probably perform similar to the current method. It
  72 would be interesting to try using existing speech models on singing-voice
  73 recognition in extreme singing styles to see whether the phone models can say
  74 anything about a growling voice.
  75
  76 \section{Conclusion}
  77 This study shows that existing techniques for singing-voice detection
  78 designed for regular singing-voices also work on \gls{dm} and \gls{dom} that
  79 contain extreme singing styles like grunting. With a standard \gls{ANN}
  80 classifier using \gls{MFCC} features a performance of $85\%$ can be achieved
  81 which is similar to the same techniques used on regular singing. This means
  82 that it might also be suitable as a pre-processing step for lyrics forced
  83 alignment. Moreover, the \emph{singer}-voice recognition experiments scored
  84 similarly.
  85
  86 To determine whether the model generalizes, alien data has been offered to the
  87 model to see how it performs. It was shown that for similar singing styles the
  88 models perform similar. The alien data offered containing different singing
  89 styles, atmospheric noise and accompaniment is classified worse.
  90
  91 From the results we can conclude that the model generalizes well over the
  92 trainings set, even with a small number of hidden nodes. The models with 3 or 5
  93 hidden nodes score a little worse than their bigger brothers but there is
  94 hardly any difference between the performance of a model with 8 or 13 nodes.
  95 Moreover, contrary than expected the window size does not seem to be doing much
  96 in the performance.