conclusion.tex

   1 \section{Conclusion}
   2 This study shows that existing techniques for singing-voice detection
   3 designed for regular singing-voices also work respectably on extreme singing
   4 styles like grunting. With a standard \gls{ANN} classifier using \gls{MFCC}
   5 features a performance of $85\%$ can be achieved which is similar to the same
   6 techniques on regular singing. This means that it might be suitable as a
   7 pre-processing step for lyrics forced alignment. The model performs pretty well
   8 on alien data that uses similar singing techniques as the training set.
   9 However, the model does not cope very well with different singing techniques or
  10 with data that contains a lot of atmospheric noise and accompaniment.
  11
  12 From the results we conclude that the model generalizes well over the trainings
  13 set, even with little hidden nodes. The models with 3 or 5 hidden nodes score a
  14 little worse than their bigger brothers but there is hardly any difference
  15 between the performance of a model with 8 or 13 nodes. Moreover, contrary than
  16 expected the window size does not seem to be doing much in the performance.
  17
  18 \subsection{Future research}
  19 \paragraph{Forced aligment: }
  20 Future interesting research includes doing the actual forced alignment. This
  21 probably requires entirely different models. The models used for real speech
  22 are probably not suitable because the acoustic properties of a regular
  23 singing-voice are very different from those of a growling voice, let alone
  24 speech.
  25
  26 \paragraph{Generalization: }
  27 Secondly, it would be interesting if a model could be trained that could
  28 discriminate a singing voice for all styles of singing including growling.
  29 Moreover, it is possible to investigate the performance of detecting growling
  30 on regular singing-voice trained models and the other way around.
  31
  32 \paragraph{Decorrelation }
  33 Another interesting research continuation would be to investigate whether the
  34 decorrelation step of the feature extraction is necessary. This transformation
  35 might be inefficient or unnatural. The first layer of weights in the model
  36 could be seen as a first processing step. If another layer is added that layer
  37 could take over the role of the decorrelating. The downside of this is that
  38 training the model is tougher because there are a many more weights to train.
  39
  40 \paragraph{Genre detection: }
  41 \emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
  42 way of genre-detection. Therefore it might be interesting to figure out whether
  43 this is generalizable to general genre recognition. This requires more data
  44 from different genres to be added to the dataset and the models to be
  45 retrained.
  46
  47 \paragraph{\glspl{HMM}: }
  48 A lot of similar research on singing-voice detection uses \glspl{HMM} and
  49 existing phone models. It would be interesting to try the same approach on
  50 extreme singing styles to see whether the phone models can say anything about a
  51 growling voice.
  52
  53 %Discussion section
  54 \section{Discussion}
  55 The dataset used is not very big. Only three albums are annotated and used
  56 as training data. The albums chosen do represent the ends of the spectrum and
  57 therefore the resulting model can be very general. However, it could also mean
  58 that the model is able to recognize three islands in the entire space of
  59 grunting. This does not seem the case since the results show that almost all
  60 alien data also has a good performance. However, the data has been picked to
  61 represent the edges of the spectrum. While testing \emph{Catacombs} it seemed
  62 that this was not the case since the performance was very poor. Adding
  63 \emph{Catacombs} or a similar style to the training set can probably overcome
  64 this limitation.