Subject: [jira] [Created] (NUTCH-2407) Memory leak causing
Nutch Server to run out of memory

Vyacheslav Pascarel created NUTCH-2407:

Summary: Memory leak causing Nutch Server to run out of memory
Key: NUTCH-2407
Project: Nutch
Issue Type: Bug
Components: nutch server
Affects Versions: 2.3.1
Environment: Ubuntu 16.04 64-bit
Oracle Java 8 64-bit
Nutch 2.3.1 (standalone deployment)
MongoDB 3.4
Reporter: Vyacheslav Pascarel

My application is trying to perform continuous crawling using Nutch REST
services. The application injects a seed URL and then repeats
GENERATE/FETCH/PARSE/UPDATEDB sequence requested number of times (each step in
the sequence is executed upon successful competition of the previous step then
the whole sequence is repeated again). Here is a brief description of the job:
* Number of GENERATE/FETCH/PARSE/UPDATEDB cycles per run: 50
* 'topN' parameter value of GENERATE step in each cycle: 10
* Seed URL:
* Regex URL filters for all jobs:
** *"-^.\{1000,\}$"* - exclude very long URLs
** *"+."* - include the rest

To monitor Nutch server I use Java VisualVM that comes with Java SDK. After
each run (50 cycles of GENERATE/FETCH/PARSE/UPDATEDB) I perform garbage
collection using the mentioned tool and check memory usage. My observation is
that Nutch Server leaks ~25MB per run.

NOTES: I added custom HTTP DELETE services to clean job history in
NutchServerPoolExecutor and remove all custom configurations from
RAMConfManager after each run. So observed ~25MB memory leak is after job
history/configuration cleanup.

This message was sent by Atlassian JIRA


Programming list archiving by: Enterprise Git Hosting