Subject: Re: Question on 2.x sitemap functionality

Hi Kenneth,

Thanks for following up! Besides the fact that there is almost no javadoc available for the sitemap classes and a lot of the main job classes... I was mainly using the GSOC project page and lifecycle pdf as reference. The nutch 2 lifecycle pdf says that sitemap detection is done during injection, but I just found it to be within fetching with the -stmDetect flag. Looking at the code also confirms that fetch is the only process that uses the CommonCrawler sitemap features. In addition, the sitemap feature wiki page contains only a link to the GSOC project for Nutch 2.x, which is what I'm using.

In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing to get it working on Windows (hadoop binary file related problems, did extensive troubleshooting). Let me know if there's any additional information I can provide you with.

I completely understand that documentation for a community project can be difficult, and I'll be more than happy to add/fix some if I can. But right now I'm still trying to verify/falsify some of the claims in the documentation...



On 08/01/2017 05:30 PM, kenneth mcfarland wrote:
Can you please be more specific about your environment and what you have found to be out of date please?
On Aug 1, 2017 5:28 PM, "Michael Chen" <[email protected]> wrote:
Problem resolved. The crawl script and web documentation are out of date. Nutch script works fine.

Might be a good idea to update sitemap related documentation at some point... takes quite a bit of speculation and experimentation right now...


On 07/31/2017 12:21 PM, Michael Chen wrote:
Dear fellow Nutch developers,

I've been trying to use Nutch 2 sitemap function to crawl and index all pages on the sitemap indices. It seems that integration with CommonCrawler sitemap tools only exist in 2.x branch. But after I got it to work with Hbase 1.2.3, it didn't fetch, parse and index the sitemap indices and sitemaps at all.

I also looked into the code a bit and everything seems to make sense, except I couldn't further trace the data flow beyond in the FetchReducer. I'm testing it on Linux with the "crawl" script in /bin, so I'm not sure if how I can debug this. Please let me know if there's any further information that I can provide you with to help troubleshoot this issue. Thanks in advance!

Best regards,


Programming list archiving by: Enterprise Git Hosting