3 \item Disfluencies are annotated by surrounding them with square braces.
4 The first bit shows the
\emph{reparandum
}, the second bit denoted with
5 the
\texttt{+
} shows the
\emph{editing phase
} and the last bit shows
6 the
\emph{repair
}. We want to only keep the repair since that depicts
7 the correct, meant by the speaker, speech.
9 \verb#s/\
[.*?\+\
{.*?\
}(.*?)\
]/
\1/g#
13 \item \verb#s/# Substitution.
14 \item \verb#\
[# Matches the opening square bracket. We escape this
15 because
\verb#
[# is a regular expression control character and
16 we want to match a literal.
17 \item \verb#.*?\+# Matches non-greedily everything up to the plus
18 mark. Thus the
\emph{reparandum
}. Note that the
19 \emph{reparandum
} can be empty (in case the speaker immediately
20 start editing). We escape the
\verb#+# for the same reason as
22 \item \verb#\
{.*?\
}# Matches everything between the curly braces.
23 Thus the
\emph{editing phase
}. Note again that this match can
24 only contain empty curly braces since the
\emph{editing phase
}
26 \item \verb#(.*?)# Matches non-greedily everything up to the
27 closing square brace and captures it in the group. Thus the
28 \emph{repair
}. Note that we do not require this group to be the
29 exact same as the
\emph{reparandum
}.
30 \item \verb#\
]/# Matches the closing square bracket and we proceed
31 to the replacement. We escape this for the same reason as
33 \item \verb#
\1/g# We replace the entire match with only the
34 captured
\emph{repair
} group and do this globally since there
35 can be multiple repairs in an utterance.
39 \item \textsc{MEMM
}'s use features to add extra information to words.
40 \textsc{IOB
} tagging is a partial parsing or chunking method that only
41 discriminates between
\emph{Beginning
} (
\texttt{B
}),
\emph{Internal
}
42 (
\texttt{I
}) and
\emph{Outside
} (
\texttt{O
}) categories.
44 Say we use the same segmentation as before, we should mark the
45 \emph{reparandum
} and
\emph{editing phase
} as
\emph{Outside
}
46 (
\texttt{O
}) parts and the repair should be parsed as usual. Note that
47 a chunk then can include
\texttt{O
} marked segments. For example in ``a
48 car uh plane'' the ``car uh'' part will be tagged as
\texttt{O
}, ``a''
49 as
\texttt{B
\_NP} and ``plane'' as
\texttt{I
\_NP}.
51 For the algorithms it might be necessary to add a different tag to
52 denote internal
\texttt{O
} segments. This can be done by adding a
53 suffix to the
\texttt{O
} tag. In the previous example the text will
54 then be chunked as:
\texttt{B
\_NP O
\_NP I
\_NP}.
56 Concerning the
\textsc{MEMM
} features, obviously editing phase segments
57 should be marked as such but also the reparandum should be tagged as
58 such to not confuse it with a regular segment.
61 \item Repairs are only noticed when you can lookahead to the
\emph{editing
62 phase
} markers. It might be necessary to either lookahead a little bit
63 or to work outwards from the identified
\emph{editing phase
}.
64 Right-to-left has the same problem as left-to-right in the sense that
65 it will see the repair first and also has to lookahead to know whether
66 it is part of a repair.