git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Random signal capture when using multiprocessing


On Sat, Jul 6, 2019 at 12:13 AM Jos? Mar?a Mateos <chema at rinzewind.org> wrote:
>
> Hi,
>
> This is a minimal proof of concept for something that has been bugging me for a few days:
>
> ```
> $ cat signal_multiprocessing_poc.py
>
> import random
> import multiprocessing
> import signal
> import time
>
> def signal_handler(signum, frame):
>     raise Exception(f"Unexpected signal {signum}!")
>
> def process_message(args):
>     time.sleep(random.random() / 100.)
>
> if __name__ == "__main__":
>     signal.signal(signal.SIGTERM, signal_handler)
>     n_round = 1
>     while n_round < 10:
>         job_list = [x for x in range(random.randint(100, 400))]
>         print(f"Running round {n_round} with {len(job_list)} jobs")
>         with multiprocessing.Pool(8) as p1:
>             p1.map(process_message, job_list)
>         n_round += 1
>
> ```
>
> So basically I have some subprocesses that don't do anything, just sleep for a few milliseconds, and I capture SIGTERM signals. I don't expect
>
> This is the output:
>
> ```
> $ python signal_multiprocessing_poc.py
> Running round 1 with 244 jobs
> Running round 2 with 151 jobs
> Running round 3 with 173 jobs
> Running round 4 with 124 jobs
> Running round 5 with 249 jobs
> Running round 6 with 359 jobs
> Process ForkPoolWorker-48:
> Traceback (most recent call last):
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
>     self.run()
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
>     self._target(*self._args, **self._kwargs)
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
>     task = get()
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/queues.py", line 352, in get
>     res = self._reader.recv_bytes()
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/synchronize.py", line 98, in __exit__
>     return self._semlock.__exit__(*args)
>   File "signal_multiprocessing_poc.py", line 7, in signal_handler
>     raise Exception(f"Unexpected signal {signum}!")
> Exception: Unexpected signal 15!
> Running round 7 with 185 jobs
> Running round 8 with 246 jobs
> Running round 9 with 217 jobs
> Process ForkPoolWorker-68:
> Traceback (most recent call last):
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
>     self.run()
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
>     self._target(*self._args, **self._kwargs)
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
>     task = get()
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/queues.py", line 352, in get
>     res = self._reader.recv_bytes()
>   File "/home/j_mariamateos/miniconda3/lib/python3.7/multiprocessing/synchronize.py", line 98, in __exit__
>     return self._semlock.__exit__(*args)
>   File "signal_multiprocessing_poc.py", line 7, in signal_handler
>     raise Exception(f"Unexpected signal {signum}!")
> Exception: Unexpected signal 15!
> ```
>
> Can anyone help me understand why the random SIGTERM gets captured? I guess some of the children processes receive the signal for termination and that's why I see it, but I don't understand the randomness of it.
>
> Of course, as a coworker just pointed out, this works (telling the children processes to ignore the signal handler):
>
> ```
> import random
> import multiprocessing
> import signal
> import time
>
> def signal_handler(signum, frame):
>     raise Exception(f"Unexpected signal {signum}!")
>
> def init_worker():
>     signal.signal(signal.SIGTERM, signal.SIG_IGN)
>     signal.signal(signal.SIGINT, signal.SIG_IGN)
>
> def process_message(args):
>     time.sleep(random.random() / 100.)
>
> if __name__ == "__main__":
>
>     signal.signal(signal.SIGTERM, signal_handler)
>     signal.signal(signal.SIGINT, signal_handler)
>     n_round = 1
>
>     while n_round < 20:
>         job_list = [x for x in range(random.randint(100, 400))]
>         print(f"Running round {n_round} with {len(job_list)} jobs")
>         with multiprocessing.Pool(8,init_worker) as p1:
>             p1.map(process_message, job_list)
>         n_round += 1
> ```
>

In the interests of test consistency, I removed all the random number
generation from the test, locking in some numbers instead:

import multiprocessing
import signal
import time

def signal_handler(signum, frame):
    raise Exception(f"Unexpected signal {signum}!")

def process_message(args):
    time.sleep(0.005)

if __name__ == "__main__":
    signal.signal(signal.SIGTERM, signal_handler)
    for n_round in range(1, 10):
        job_list = range(250)
        print(f"Running round {n_round} with {len(job_list)} jobs")
        with multiprocessing.Pool(8) as p1:
            p1.map(process_message, job_list)


The SIGTERMs are still happening, and it seems to be somewhat
unpredictable. I can have a couple of runs of the program with none,
then two come up in the same run. Once, I actually got a chained
exception.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "mpterm.py", line 16, in <module>
    with multiprocessing.Pool(8) as p1:
  File "/usr/local/lib/python3.9/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 218, in __init__
    p.join()
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 147, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process

(Yes, that says python3.9 there, but I don't think this is
version-dependent; the same thing happens on 3.8, 3.7, and 3.6. On
3.5, of course, f-strings aren't available.)

But if I comment out the signal.signal line, there seem to be no ill
effects. I suspect that what you're seeing here is the multiprocessing
module managing its own subprocesses, telling some of them to shut
down. I added a print call to multiprocessing/popen_fork.py inside
_send_signal (line 53 or thereabouts depending on Python version) and
saw a *lot* of termination signals being sent; only a few actually
triggered the exception message. My guess is that most of the time,
the SIGTERM is smoothly handled as part of the Pool's __exit__ method,
but sometimes the child process is blocked on something, and has to be
told to shut down; and then normally, the signal gets caught and
handled just fine, but since you're explicitly hooking it, you get to
see it.

ChrisA