Subject: Re: HTML Support for jsoup-extractor in Nutch 2.x?
Nevermind problem nonexistent... After reading the code realized that
the problem is with the out-of-box jsoup-extractor.xml missing an
<extractor> root element... The example xml is correct though.
So HTML is supported based on the jsoup HTML parser. I'm not getting any
extracted value yet but I'll keep trying.
On 08/02/2017 02:42 PM, Michael Chen wrote:
I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives
"The markup in the document following the root element must be
well-formed" error when I hand it HTML. I re-read the descriptions in
NUTCH-2389 and it seems that it's designed to parse XML only.
I'm still quite new to Nutch so I wanted some opinions on this, should
I try to implement HTML DOM building for jsoup-extractor or is it too
much work/not feasible in Nutch 2.x? Any suggestions would be greatly