The data is shuffled before fed to the network to mitigate the risk of
overfitting on one album. Every model was trained using $10$ epochs and a
-batch size of $32$.
+batch size of $32$. The training set and test set are separated by taking a
+$90\%$ slice of all the data.
\begin{equation}\label{eq:relu}
f(x) = \left\{\begin{array}{rcl}