words in fishing was not there. Maybe this language only has one word
for fish whereas English has many. In this way extra details can be
inserted. Of course this also works the other way around. A popular,
- dubious statement is often made that some Inu{\"\i}t language has over a
- hundred words for snow. When such a specialised word is used it might
+ dubious statement is often made that some Inu{\"\i}t language has over
+ a hundred words for snow. When such a specialised word is used it might
not be possible to correctly translate it at all to English and
therefore we lose detail.
\item
The quality of the knowledge extraction depends heavily on the user's
language because of the aforementioned lexical gaps. However, these
- lexical gaps might be bridged with a suitable translation system.
+ lexical gaps might be bridged with a suitable translation system that
+ can add information to the words that might not be available in the
+ target language. Some topics are just harder to discuss, or ask
+ questions about in some languages.
+
\end{enumerate}
is a difference in script. Transliteration between scripts often
introduces extra letters.
- For example the russian form of
- \emph{Muhammad} becomes \emph{Mukhammed}. The \emph{kh} is a
- construction that is not used in the English language but it sound a
- lot like the \emph{ch} in the Scottish \emph{loch}. Such added
- characters can introduce higher edit distances. We can possibly
- overcome this problem by using a broader notion of characters and look
- at phonemes for example.
-
- \emph{Viterbi} on the other hand
+ For example the russian form of \emph{Muhammad} becomes
+ \emph{Mukhammed}. The \emph{kh} is a construction that is not used in
+ the English language but it sound a lot like the \emph{ch} in the
+ Scottish \emph{loch}. Such added characters can introduce higher edit
+ distances. We can possibly overcome this problem by using a broader
+ notion of characters and look at phonemes for example.
+ \emph{Viterbi} on the other hand is much less discrete. For a given
+ word it can calculate the most likely spelling variant taking also
+ more things into account then location of characters such as context of
+ the word. For example, \emph{viterbi} will prefer a common mistake over
+ a very unlikely mistake, while they both might have a
+ \emph{Levenshtein} distance of one. This is because the transitioning
+ probability is higher in that case.
+ The \emph{Levenshtein} distance will fail to detect spelling variants
+ that are very big abbreviations. Sometimes \emph{Muhammad} is
+ abbreviated as \emph{Mohd.} or ever \emph{M.} which will give a very
+ very high edit distance and will therefore most likely not match.
+ \emph{Viterbi} will score better on these cases since the underlying
+ structure \emph{HMM} structure can incorporate this in the transition
+ probabilities.
% 3b
\item
-
+ Another techniques for normalizing spelling variants might use the
+ properties of \emph{NGrams}.
+
+ By using \emph{NGrams} the context of the word is also taken into
+ account. For example in the \emph{Muhammad} example, the Chinese
+ language has two forms for the name, one form is for the prophet and
+ one form is for the normal name. From the context it might be very
+ clear which one is meant and therefore \emph{NGrams} will probably have
+ a higher precision in this case.
+
+ Of course, when there is enough data at hand one could also use a
+ neural network to normalize spelling variants. The advantage of neural
+ networks would be the fact that it might even detect never seen
+ before spelling variant pretty well.
+
+ Features that are usable for the learners are of course the characters
+ itself. Contextual information is also very important in the feature
+ set. Besides those two, the cultural variables in which a language
+ resides can also be used. Some cultures use some names more than
+ others which can be very valuable information when normalizing proper
+ names.
+
+ Besides information about the word itself it might also be fruitful,
+ especially in the neural network case, to include the language
+ production rules in the feature set. These rules describe result in
+ distinguishing a typo from an intentional spelling variant.
+ This could even be extended by including the errors typists often make.
+ Some letters are closer on the keyboard than others and that could be a
+ feature that improves the performance of detecting typo's.
% 3c
\item
-
+ When a statistical machine translation system also has been trained on
+ proper names then normalizing the proper names will be pretty easy. It
+ is not that difficult to recognize proper names and the translations
+ are not very ambiguous. Because statistical machine translation systems
+ are often based on \emph{NGram} models this will include context and
+ therefore spelling variations that differ in in different contexts are
+ also accounted for.
\end{enumerate}