conclusion.tex

   1 \section{Conclusion}
   2 This study shows that existing techniques for singing-voice detection
   3 designed for regular singing-voices also work on \gls{dm} and \gls{dom} that
   4 contain extreme singing styles like grunting. With a standard \gls{ANN}
   5 classifier using \gls{MFCC} features a performance of $85\%$ can be achieved
   6 which is similar to the same techniques used on regular singing. This means
   7 that it might also be suitable as a pre-processing step for lyrics forced
   8 alignment. Moreover, the \emph{singer}-voice recognition experiments scored
   9 similarly.
  10
  11 To determine whether the model generalizes, alien data has been offered to the
  12 model to see how it performs. It was shown that for similar singing styles the
  13 models perform similar. The alien data offered containing different singing
  14 styles, atmospheric noise and accompaniment is classified less good.
  15
  16 From the results we can conclude that the model generalizes well over the
  17 trainings set, even with little hidden nodes. The models with 3 or 5 hidden
  18 nodes score a little worse than their bigger brothers but there is hardly any
  19 difference between the performance of a model with 8 or 13 nodes. Moreover,
  20 contrary than expected the window size does not seem to be doing much in the
  21 performance.
  22
  23 %Discussion section
  24 \section{Discussion}
  25 The dataset used only contains three albums and might not be considered varied.
  26 However, the albums are picked to represent the ends of the growling spectrum.
  27 Therefore the resulting model can be very general. On the other side, it could
  28 also result in a model that is overfitted the three islands in entire space of
  29 grunting voices.
  30
  31 In this case it seems that the model generalizes well. The alien data --- similar
  32 to the training data --- offered to the model, results in a good performance.
  33 However, alien data that has a very different style does not perform as good.
  34 While testing \emph{Catacombs} the performance was very poor. Adding
  35 \emph{Catacombs} or a similar style to the training set can probably overcome
  36 this performance issue. Thus, the performance on alien data can probably be
  37 increased by having a bigger and more varied dataset.
  38
  39 \section{Future research}
  40 \paragraph{Forced alignment: }
  41 Future interesting research includes doing the actual forced alignment. It was
  42 found that pre segmenting the audio made lyrics forced alignment easier.
  43 Attempting this will require different models because the models are not based
  44 on phones.
  45
  46 Growling voices are acoustically very different than regular singing voices.
  47 Therefore, regular phonetic models created for speech will probably not be
  48 useful and new models must be made when attempting forced alignment.
  49
  50 \paragraph{Generalization: }
  51 Secondly, it would be interesting if a model could be trained that could
  52 discriminate a singing voice for all styles of singing including growling. This
  53 can be done by training the models also on regular singing styles.
  54
  55 To really explore the limits of the methods it would be interesting to
  56 investigate the performance of detecting one with the other. This means using
  57 existing models that were trained on regular singing voices to detect grunting.
  58 The same experiments can be done the other way around as well.
  59
  60 \paragraph{Decorrelation }
  61 Adding another layer to the \gls{MLP} can be seen as applying an extra
  62 normalization step to the input data. It could be that the last step in
  63 converting the waveforms to \gls{MFCC} can be performed by the neural network.
  64 The current decorrelation step might be inefficient or unnatural. The \gls{ANN}
  65 train the weights in such a way that performance is maximized. It would be
  66 interesting to see whether this results in a different normalization step. The
  67 downside of this is that training the model is more complex because there are
  68 many more weights to train.
  69
  70 \paragraph{Genre detection: }
  71 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
  72 way of genre-detection. The results have shown that this is a problem that can
  73 be tackled using the methods used. In the literature, similar methods have been
  74 used to discriminate genres in regular music styles and it even has been
  75 attempted to discriminate genres within extreme music styles. Therefore it
  76 might be interesting to figure out whether this specific method is
  77 generalizable to general genre recognition.  This requires more data from
  78 different genres to be added to the dataset and the models to be retrained.
  79 Again, it would be interesting to see what comes out of the models when
  80 offering regular music and the other way around. Maybe the characteristics of
  81 some regular music genres are similar to those in extreme music genres.
  82
  83 \paragraph{\glspl{HMM}: }
  84 A lot of similar research doing singing-voice detection has used \glspl{HMM} as
  85 the basis for the models. Moreover, some have used existing --- speech
  86 trained --- phone models directly to discriminate music from non music. A
  87 \gls{HMM} approach would probably perform similar to the current method. It
  88 would be interesting to try using existing speech models on singing-voice
  89 recognition in extreme singing styles to see whether the phone models can say
  90 anything about a growling voice.
  91