Subject: Re: HTML Support for jsoup-extractor in Nutch 2.x?

Nevermind problem nonexistent... After reading the code realized that the problem is with the out-of-box jsoup-extractor.xml missing an <extractor> root element... The example xml is correct though.

So HTML is supported based on the jsoup HTML parser. I'm not getting any extracted value yet but I'll keep trying.



On 08/02/2017 02:42 PM, Michael Chen wrote:

I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives "The markup in the document following the root element must be well-formed" error when I hand it HTML. I re-read the descriptions in NUTCH-2389 and it seems that it's designed to parse XML only.

I'm still quite new to Nutch so I wanted some opinions on this, should I try to implement HTML DOM building for jsoup-extractor or is it too much work/not feasible in Nutch 2.x? Any suggestions would be greatly appreciated!

Go Nutch!


Programming list archiving by: Enterprise Git Hosting