Subject: Re: Question on 2.x sitemap functionality
Please know the inquiry is simply to understand how myself and others can document the code better. Thank you for your response.
On Aug 1, 2017 5:45 PM, "Michael Chen" <[email protected]
Thanks for following up! Besides the fact that there is almost no
javadoc available for the sitemap classes and a lot of the main
job classes... I was mainly using the GSOC project page and
lifecycle pdf as reference. The nutch 2 lifecycle pdf says that
sitemap detection is done during injection, but I just found it to
be within fetching with the -stmDetect flag. Looking at the code
also confirms that fetch is the only process that uses the
CommonCrawler sitemap features. In addition, the sitemap feature
wiki page contains only a link to the GSOC project for Nutch 2.x,
which is what I'm using.
In specific, I'm running Nutch 2.x on Ubuntu 16.04 after failing
to get it working on Windows (hadoop binary file related problems,
did extensive troubleshooting). Let me know if there's any
additional information I can provide you with.
I completely understand that documentation for a community
project can be difficult, and I'll be more than happy to add/fix
some if I can. But right now I'm still trying to verify/falsify
some of the claims in the documentation...
On 08/01/2017 05:30 PM, kenneth
Can you please be more specific about your
environment and what you have found to be out of date please?
On Aug 1, 2017 5:28 PM, "Michael Chen"
resolved. The crawl script and web documentation are out of
date. Nutch script works fine.
Might be a good idea to update sitemap related documentation
at some point... takes quite a bit of speculation and
experimentation right now...
On 07/31/2017 12:21 PM, Michael Chen wrote:
Dear fellow Nutch developers,
I've been trying to use Nutch 2 sitemap function to crawl
and index all pages on the sitemap indices. It seems that
integration with CommonCrawler sitemap tools only exist in
2.x branch. But after I got it to work with Hbase 1.2.3,
it didn't fetch, parse and index the sitemap indices and
sitemaps at all.
I also looked into the code a bit and everything seems to
make sense, except I couldn't further trace the data flow
beyond Toolrunner.run() in the FetchReducer. I'm testing
it on Linux with the "crawl" script in /bin, so I'm not
sure if how I can debug this. Please let me know if
there's any further information that I can provide you
with to help troubleshoot this issue. Thanks in advance!