-processing. We chose RSS feeds because RSS feeds lack inherent structural
-information but are still very structured. This structure is because, as said
-above, the RSS feeds are generated and therefore almost always look the same.
-Also, in RSS feeds most venues use particular structural identifiers that are
-characters. They separate fields with vertical bars, commas, whitespace and
-more non text characters. These field separators and keywords can be hard for a
-computer to detect but people are very good in detecting these. With one look
-they can identify the characters and keywords and build a pattern in their
-head. Another reason we chose RSS is their temporal consistency, RSS feeds are
-almost always generated and because of that the structure of the entries is
-very unlikely to change. Basically the RSS feeds only change structure when the
-CMS that generates it changes the generation algorithm. This property is useful
-because the crawlers then do not have to be retrained very often. To detect
-the underlying structures a technique is used that exploits subword matching
-with graphs.
+processing and possibly OCR. We chose RSS feeds because RSS feeds lack inherent
+structural information but are still very structured. This structure is
+because, as said above, the RSS feeds are generated and therefore almost always
+look the same. Also, in RSS feeds most venues use particular structural
+identifiers that are characters. They separate fields with vertical bars,
+commas, whitespace and more non text characters. These field separators and
+keywords can be hard for a computer to detect but people are very good in
+detecting these. With one look they can identify the characters and keywords
+and build a pattern in their head. Another reason we chose RSS is their
+temporal consistency, RSS feeds are almost always generated and because of that
+the structure of the entries is very unlikely to change. Basically the RSS
+feeds only change structure when the CMS that generates it changes the
+generation algorithm. This property is useful because the crawlers then do not
+have to be retrained very often. To detect the underlying structures a
+technique is used that exploits subword matching with graphs.