Subject: [jira] [Created] (NUTCH-2408) CrawlDb: allow
update from unparsed segments

Sebastian Nagel created NUTCH-2408:

Summary: CrawlDb: allow update from unparsed segments
Key: NUTCH-2408
Project: Nutch
Issue Type: Improvement
Components: crawldb
Affects Versions: 1.13
Reporter: Sebastian Nagel
Priority: Minor
Fix For: 1.14

The command updatedb (class o.a.n.crawl.CrawlDb) does not allow to update the
CrawlDb with fetch status only (from segment subdirectory crawl_fetch) without
also reading crawl_parse (which contains outlinks but also scores, signatures
and meta data).

A workflow which does not require parsing of documents (e.g., because raw HTML
content is exported to WARC files) is then unable to update the CrawlDb to
store the fetch status.

This message was sent by Atlassian JIRA


Programming list archiving by: Enterprise Git Hosting