Subject: HTML Support for jsoup-extractor in Nutch 2.x?


I'm trying to use the new jsoup-extractor in Nutch 2.x but it gives "The markup in the document following the root element must be well-formed" error when I hand it HTML. I re-read the descriptions in NUTCH-2389 and it seems that it's designed to parse XML only.

I'm still quite new to Nutch so I wanted some opinions on this, should I try to implement HTML DOM building for jsoup-extractor or is it too much work/not feasible in Nutch 2.x? Any suggestions would be greatly appreciated!

Go Nutch!


Programming list archiving by: Enterprise Git Hosting