is a difference in script. Transliteration between scripts often
introduces extra letters.
- For example the russian form of
- \emph{Muhammad} becomes \emph{Mukhammed}. The \emph{kh} is a
- construction that is not used in the English language but it sound a
- lot like the \emph{ch} in the Scottish \emph{loch}. Such added
- characters can introduce higher edit distances. We can possibly
- overcome this problem by using a broader notion of characters and look
- at phonemes for example.
-
- \emph{Viterbi} on the other hand
+ For example the russian form of \emph{Muhammad} becomes
+ \emph{Mukhammed}. The \emph{kh} is a construction that is not used in
+ the English language but it sound a lot like the \emph{ch} in the
+ Scottish \emph{loch}. Such added characters can introduce higher edit
+ distances. We can possibly overcome this problem by using a broader
+ notion of characters and look at phonemes for example.
+ \emph{Viterbi} on the other hand is much less discrete. For a given
+ word it can calculate the most likely spelling variant taking also
+ more things into account then location of characters such as context of
+ the word. For example, \emph{viterbi} will prefer a common mistake over
+ a very unlikely mistake, while they both might have a
+ \emph{Levenshtein} distance of one. This is because the transitioning
+ probability is higher in that case.
+ The \emph{Levenshtein} distance will fail to detect spelling variants
+ that are very big abbreviations. Sometimes \emph{Muhammad} is
+ abbreviated as \emph{Mohd.} or ever \emph{M.} which will give a very
+ very high edit distance and will therefore most likely not match.
+ \emph{Viterbi} will score better on these cases since the underlying
+ structure \emph{HMM} structure can incorporate this in the transition
+ probabilities.
% 3b
\item
-
+ Another techniques for normalizing spelling variants might use the
+ properties of \emph{NGrams}.
+
+ By using \emph{NGrams} the context of the word is also taken into
+ account. For example in the \emph{Muhammad} example, the Chinese
+ language has two forms for the name, one form is for the prophet and
+ one form is for the normal name. From the context it might be very
+ clear which one is meant and therefore \emph{NGrams} will probably have
+ a higher precision in this case.
+
+ Of course, when there is enough data at hand one could also use a
+ neural network to normalize spelling variants. The advantage of neural
+ networks would be the fact that it might even detect never seen
+ before spelling variant pretty well.
+
+ Features that are usable for the learners are of course the characters
+ itself. Contextual information is also very important in the feature
+ set. Besides those two, the cultural variables in which a language
+ resides can also be used. Some cultures use some names more than
+ others which can be very valuable information when normalizing proper
+ names.
+
+ Besides information about the word itself it might also be fruitful,
+ especially in the neural network case, to include the language
+ production rules in the feature set. These rules describe result in
+ distinguishing a typo from an intentional spelling variant.
+ This could even be extended by including the errors typists often make.
+ Some letters are closer on the keyboard than others and that could be a
+ feature that improves the performance of detecting typo's.
% 3c
\item
-
+ When a statistical machine translation system also has been trained on
+ proper names then normalizing the proper names will be pretty easy. It
+ is not that difficult to recognize proper names and the translations
+ are not very ambiguous. Because statistical machine translation systems
+ are often based on \emph{NGram} models this will include context and
+ therefore spelling variations that differ in in different contexts are
+ also accounted for.
\end{enumerate}