Subject: HTML Support for jsoup-extractor in Nutch 2.x?
I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives "The
markup in the document following the root element must be well-formed"
error when I hand it HTML. I re-read the descriptions in NUTCH-2389 and
it seems that it's designed to parse XML only.
I'm still quite new to Nutch so I wanted some opinions on this, should I
try to implement HTML DOM building for jsoup-extractor or is it too much
work/not feasible in Nutch 2.x? Any suggestions would be greatly