From: Mart Lubbers Date: Thu, 19 Jan 2017 13:57:12 +0000 (+0100) Subject: final exam final X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=17acce5984b04c4c02ff808ab9b20dc9627d8c05;p=itlast1617.git final exam final --- diff --git a/.gitignore b/.gitignore index d841b86..fbe078d 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,4 @@ *.log *.fmt *.mlog +*.bak diff --git a/exam2/exam.tex b/exam2/exam.tex index 8b0c6b7..02350af 100644 --- a/exam2/exam.tex +++ b/exam2/exam.tex @@ -1,13 +1,11 @@ %&exam \begin{document} -\maketitleru[% - course={Introduction to Language and Speech Technology}, +\maketitleru[course={Introduction to Language and Speech Technology}, authorstext={Author:}] -\begin{enumerate} +\begin{enumerate}[label=\arabic*)] % Question 1 \item\input{q1.tex} - \newpage % Question 2 \item\input{q2.tex} @@ -15,5 +13,4 @@ % Question 3 \item\input{q3.tex} \end{enumerate} - \end{document} diff --git a/exam2/preamble.tex b/exam2/preamble.tex index 58c52bb..445dea7 100644 --- a/exam2/preamble.tex +++ b/exam2/preamble.tex @@ -3,8 +3,7 @@ \usepackage{rutitlepage} \usepackage{geometry} \usepackage{enumitem} -%\usepackage{listings} -\title{Final exam} +\title{Final exam, 1st chance} \author{Mart Lubbers\\s4109503} \date{\today} diff --git a/exam2/q1.tex b/exam2/q1.tex index a715220..73ee618 100644 --- a/exam2/q1.tex +++ b/exam2/q1.tex @@ -69,6 +69,7 @@ little bit more illogical, construction to say the same. By doing this the translation might be a bit better and therefore easier to understand for the non-native speaker. + \newpage % 1c \item diff --git a/exam2/q2.tex b/exam2/q2.tex index c15c231..d98480d 100644 --- a/exam2/q2.tex +++ b/exam2/q2.tex @@ -35,8 +35,8 @@ words in fishing was not there. Maybe this language only has one word for fish whereas English has many. In this way extra details can be inserted. Of course this also works the other way around. A popular, - dubious statement is often made that some Inu{\"\i}t language has over a - hundred words for snow. When such a specialised word is used it might + dubious statement is often made that some Inu{\"\i}t language has over + a hundred words for snow. When such a specialised word is used it might not be possible to correctly translate it at all to English and therefore we lose detail. @@ -44,5 +44,9 @@ \item The quality of the knowledge extraction depends heavily on the user's language because of the aforementioned lexical gaps. However, these - lexical gaps might be bridged with a suitable translation system. + lexical gaps might be bridged with a suitable translation system that + can add information to the words that might not be available in the + target language. Some topics are just harder to discuss, or ask + questions about in some languages. + \end{enumerate} diff --git a/exam2/q3.tex b/exam2/q3.tex index e3b4f9e..085c892 100644 --- a/exam2/q3.tex +++ b/exam2/q3.tex @@ -7,22 +7,67 @@ is a difference in script. Transliteration between scripts often introduces extra letters. - For example the russian form of - \emph{Muhammad} becomes \emph{Mukhammed}. The \emph{kh} is a - construction that is not used in the English language but it sound a - lot like the \emph{ch} in the Scottish \emph{loch}. Such added - characters can introduce higher edit distances. We can possibly - overcome this problem by using a broader notion of characters and look - at phonemes for example. - - \emph{Viterbi} on the other hand + For example the russian form of \emph{Muhammad} becomes + \emph{Mukhammed}. The \emph{kh} is a construction that is not used in + the English language but it sound a lot like the \emph{ch} in the + Scottish \emph{loch}. Such added characters can introduce higher edit + distances. We can possibly overcome this problem by using a broader + notion of characters and look at phonemes for example. + \emph{Viterbi} on the other hand is much less discrete. For a given + word it can calculate the most likely spelling variant taking also + more things into account then location of characters such as context of + the word. For example, \emph{viterbi} will prefer a common mistake over + a very unlikely mistake, while they both might have a + \emph{Levenshtein} distance of one. This is because the transitioning + probability is higher in that case. + The \emph{Levenshtein} distance will fail to detect spelling variants + that are very big abbreviations. Sometimes \emph{Muhammad} is + abbreviated as \emph{Mohd.} or ever \emph{M.} which will give a very + very high edit distance and will therefore most likely not match. + \emph{Viterbi} will score better on these cases since the underlying + structure \emph{HMM} structure can incorporate this in the transition + probabilities. % 3b \item - + Another techniques for normalizing spelling variants might use the + properties of \emph{NGrams}. + + By using \emph{NGrams} the context of the word is also taken into + account. For example in the \emph{Muhammad} example, the Chinese + language has two forms for the name, one form is for the prophet and + one form is for the normal name. From the context it might be very + clear which one is meant and therefore \emph{NGrams} will probably have + a higher precision in this case. + + Of course, when there is enough data at hand one could also use a + neural network to normalize spelling variants. The advantage of neural + networks would be the fact that it might even detect never seen + before spelling variant pretty well. + + Features that are usable for the learners are of course the characters + itself. Contextual information is also very important in the feature + set. Besides those two, the cultural variables in which a language + resides can also be used. Some cultures use some names more than + others which can be very valuable information when normalizing proper + names. + + Besides information about the word itself it might also be fruitful, + especially in the neural network case, to include the language + production rules in the feature set. These rules describe result in + distinguishing a typo from an intentional spelling variant. + This could even be extended by including the errors typists often make. + Some letters are closer on the keyboard than others and that could be a + feature that improves the performance of detecting typo's. % 3c \item - + When a statistical machine translation system also has been trained on + proper names then normalizing the proper names will be pretty easy. It + is not that difficult to recognize proper names and the translations + are not very ambiguous. Because statistical machine translation systems + are often based on \emph{NGram} models this will include context and + therefore spelling variations that differ in in different contexts are + also accounted for. \end{enumerate}