X-Git-Url: https://git.martlubbers.net/?a=blobdiff_plain;f=asr.tex;h=3f5eaaf8f02197197b5e58c1fec243ed80cdfb15;hb=0ada197a78af4323b8cf5efc508a5aab3d80e4b2;hp=f149b1d8c235c5b32edb97fddc711de83a3ab9df;hpb=ffa8517ae9d919b4da3ebeace34bc7897b56142b;p=asr1617.git diff --git a/asr.tex b/asr.tex index f149b1d..3f5eaaf 100644 --- a/asr.tex +++ b/asr.tex @@ -13,6 +13,14 @@ \newglossaryentry{dm}{name={Death Metal}, description={is an extreme heavy metal music style with growling vocals and pounding drums}} +\newglossaryentry{dom}{name={Doom Metal}, + description={is an extreme heavy metal music style with growling vocals and + pounding drums played very slowly}} +\newglossaryentry{FT}{name={Fourier Transform}, + description={is a technique of converting a time representation signal to a + frequency representation}} +\newglossaryentry{MS}{name={Mel-Scale}, + description={is a human ear inspired scale for spectral signals.}} \begin{document} \frontmatter{} @@ -141,12 +149,7 @@ To run the experiments data has been collected from several \gls{dm} albums. The exact data used is available in Appendix~\ref{app:data}. The albums are extracted from the audio CD and converted to a mono channel waveform with the correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. -When the waveforms are finished they are converted to \glspl{MFCC} vectors -using the \emph{python\_speech\_features}% -\footnote{\url{https://github.com/jameslyons/python_speech_features}} package. -All these steps combined results in thirteen tab separated features per line in -a file for every source file. Technical info about the processing steps is -given in the following sections. Every file is annotated using +Every file is annotated using Praat\cite{boersma_praat_2002} where the utterances are manually aligned to the audio. Examples of utterances are shown in Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the @@ -168,62 +171,70 @@ visible. \emph{Enthroned Abominations}}\label{fig:abominations} \end{figure} -The data is collected from two\todo{more in the future}\ studio albums. The first -band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost -25 years and have been creating the same type every album. The singer of +The data is collected from three studio albums. The +first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for +almost 25 years and have been creating the same type every album. The singer of \emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite -comprehensible. The second band is called \emph{Disgorge} and make even more -violent music. The growls of the lead singer sound more like a coffee grinder -and are more shallow. The lyrics are completely incomprehensible and therefore -some parts are not annotated with lyrics because it was too difficult to hear -what was being sung. - -\section{Methods} -\todo{To remove in final thesis} -The initial planning is still up to date. About one and a half album has been -annotated and a framework for setting up experiments has been created. -Moreover, the first exploratory experiments are already been executed and -promising. In April the experimental dataset will be expanded and I will try to -mimic some of the experiments done in the literature to see whether it performs -similar on Death Metal -\begin{table}[ht] - \centering - \begin{tabular}{cll} - \toprule - Month & Description\\ - \midrule - March - & Preparing the data\\ - & Preparing an experiment platform\\ - & Literature research\\ - April - & Running the experiments\\ - & Fiddle with parameters\\ - & Explore the possibilities for forced alignment\\ - May - & Write up the thesis\\ - & Possibly do forced alignment\\ - June - & Finish up thesis\\ - & Wrap up\\ - \bottomrule - \end{tabular} - \caption{Outline} -\end{table} +comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering +regular shouting. + +The second band is called \emph{Disgorge} and make even more violently sounding +music. The growls of the lead singer sound like a coffee grinder and are more +shallow. In the spectrals it is clearly visible that there are overtones +produced during some parts of the growling. The lyrics are completely +incomprehensible and therefore some parts were not annotated with the actual +lyrics because it was not possible what was being sung. + +Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in +Siberian Slush}. This band is a little odd compared to the previous \gls{dm} +bands because they create \gls{dom}. \gls{dom} is characterized by the very +slow tempo and low tuned guitars. The vocalist has a very characteristic growl +and performs in several moscovian bands. This band also stands out because it +uses piano's and synthesizers. The droning synthesizers often operate in the +same frequency as the vocals. + +\section{\gls{MFCC} Features} +The waveforms in itself are not very suitable to be used as features due to the +high dimensionality and correlation. Therefore we use the aften used +\glspl{MFCC} feature vectors.\todo{cite which papers use this} The actual +conversion is done using the \emph{python\_speech\_features}% +\footnote{\url{https://github.com/jameslyons/python_speech_features}} package. -\section{Features} +\gls{MFCC} features are nature inspired and built incrementally in a several of +steps. +\begin{enumerate} + \item The first step in the process is converting the time representation + of the signal to a spectral representation using a sliding window with + overlap. The width of the window and the step size are two important + parameters in the system. In classical phonetic analysis window sizes + of $25ms$ with a step of $10ms$ are often chosen because they are small + enough to only contain subphone entities. Singing for $25ms$ is + impossible so it is arguable that the window size is very small. + \item The standard \gls{FT} gives a spectral representation that has + linearly scaled frequencies. This scale is converted to the \gls{MS} + using triangular overlapping windows. + \item +\end{enumerate} \todo{Explain why MFCC and which parameters} + +\section{\gls{ANN} Classifier} \todo{Spectrals might be enough, no decorrelation} +\section{Model training} + \section{Experiments} \section{Results} \chapter{Conclusion \& Discussion} +\section{Conclusion} %Discussion section + +\section{Discussion} + \todo{Novelty} \todo{Weaknesses} \todo{Dataset is not very varied but\ldots} @@ -263,6 +274,15 @@ similar on Death Metal 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\ 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\ 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\ + 22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\ + 23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\ + 24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\ + 25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\ + 26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\ + 27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\ + 28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\ + \midrule + & & & Total: & 02:13:40\\ \bottomrule \end{tabular} \caption{Songs used in the experiments}