conclusion.tex

   1 \section{Conclusion}
   2 This study shows that existing techniques for singing-voice detection
   3 designed for regular singing-voices also work on \gls{dm} and \gls{dom} that
   4 contain extreme singing styles like grunting. With a standard \gls{ANN}
   5 classifier using \gls{MFCC} features a performance of $85\%$ can be achieved
   6 which is similar to the same techniques used on regular singing. This means
   7 that it might also be suitable as a pre-processing step for lyrics forced
   8 alignment.
   9
  10 To determine whether the model generalizes, alien data has been offered to the
  11 model to see how it performs. It was shown that for similar singing styles the
  12 models perform similar. The alien data offered containing different singing
  13 styles, atmospheric noise and accompaniment is classified less good.
  14
  15 From the results we can conclude that the model generalizes well over the
  16 trainings set, even with little hidden nodes. The models with 3 or 5 hidden
  17 nodes score a little worse than their bigger brothers but there is hardly any
  18 difference between the performance of a model with 8 or 13 nodes. Moreover,
  19 contrary than expected the window size does not seem to be doing much in the
  20 performance.
  21
  22 \section{Future research}
  23 \paragraph{Forced alignment: }
  24 Future interesting research includes doing the actual forced alignment. This
  25 probably requires entirely different models. The models used for real speech
  26 are probably not suitable because the acoustic properties of a regular
  27 singing-voice are very different from those of a growling voice, let alone
  28 speech.
  29
  30 \paragraph{Generalization: }
  31 Secondly, it would be interesting if a model could be trained that could
  32 discriminate a singing voice for all styles of singing including growling.
  33 Moreover, it is possible to investigate the performance of detecting growling
  34 on regular singing-voice trained models and the other way around.
  35
  36 \paragraph{Decorrelation }
  37 Another interesting research continuation would be to investigate whether the
  38 decorrelation step of the feature extraction is necessary. This transformation
  39 might be inefficient or unnatural. The first layer of weights in the model
  40 could be seen as a first processing step. If another layer is added that layer
  41 could take over the role of the decorrelating. The downside of this is that
  42 training the model is tougher because there are a many more weights to train.
  43
  44 \paragraph{Genre detection: }
  45 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
  46 way of genre-detection. Therefore it might be interesting to figure out whether
  47 this is generalizable to general genre recognition. This requires more data
  48 from different genres to be added to the dataset and the models to be
  49 retrained.
  50
  51 \paragraph{\glspl{HMM}: }
  52 A lot of similar research on singing-voice detection uses \glspl{HMM} and
  53 existing phone models. It would be interesting to try the same approach on
  54 extreme singing styles to see whether the phone models can say anything about a
  55 growling voice.
  56
  57 %Discussion section
  58 \section{Discussion}
  59 The dataset used is not very big. Only three albums are annotated and used
  60 as training data. The albums chosen do represent the ends of the spectrum and
  61 therefore the resulting model can be very general. However, it could also mean
  62 that the model is able to recognize three islands in the entire space of
  63 grunting. This does not seem the case since the results show that almost all
  64 alien data also has a good performance. However, the data has been picked to
  65 represent the edges of the spectrum. While testing \emph{Catacombs} it seemed
  66 that this was not the case since the performance was very poor. Adding
  67 \emph{Catacombs} or a similar style to the training set can probably overcome
  68 this limitation.