git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

More CPUs doen't equal more speed


Thanks all! The sound you are hearing is my head smacking against my hand!
Or is it my hand against my head?

Anyway, yes the problem is that I was naively using command.getoutput()
which blocks until the command is finished. So, of course, only one process
was being run at one time! Bad me!

I guess I should be looking at subprocess.Popen(). Now, a more relevant
question ... if I do it this way I then need to poll though a list of saved
process IDs to see which have finished? Right? My initial thought is to
batch them up in small groups (say CPU_COUNT-1) and wait for that batch to
finish, etc. Would it be foolish to send send a large number (1200 in this
case since this is the number of files) and let the OS worry about
scheduling and have my program poll 1200 IDs?

Someone mentioned the GIL. If I launch separate processes then I don't
encounter this issue? Right?


On Thu, May 23, 2019 at 4:24 PM MRAB <python at mrabarnett.plus.com> wrote:

> On 2019-05-23 22:41, Avi Gross via Python-list wrote:
> > Bob,
> >
> > As others have noted, you have not made it clear how what you are doing
> is
> > running "in parallel."
> >
> > I have a similar need where I have thousands of folders and need to do an
> > analysis based on the contents of one at a time and have 8 cores
> available
> > but the process may run for months if run linearly. The results are
> placed
> > within the same folder so each part can run independently as long as
> shared
> > resources like memory are not abused.
> >
> > Your need is conceptually simple. Break up the list of filenames into N
> > batches of about equal length. A simple approach might be to open N
> terminal
> > or command windows and in each one start a python interpreter by hand
> > running the same program which gets one of the file lists and works on
> it.
> > Some may finish way ahead of others, of course. If anything they do
> writes
> > to shared resources such as log files, you may want to be careful. And
> there
> > is no guarantee that several will not run on the same CPU. There is also
> > plenty of overhead associated with running full processes. I am not
> > suggesting this but it is fairly easy to do and may get you enough
> speedup.
> > But since you only seem to need a few minutes, this won't be much.
> >
> > Quite a few other solutions involve using some form of threads running
> > within a process perhaps using a queue manager. Python has multiple ways
> to
> > do this. You would simply feed all the info needed (file names in your
> case)
> > to a thread that manages a queue. It would allow up to N threads to be
> > started and whenever one finishes, would be woken to start a replacement
> > till done. Unless one such thread takes very long, they should all finish
> > reasonably close to each other. Again, lots of details to make sure the
> > threads do not conflict with each other. But, no guarantee which core
> they
> > get unless you use an underlying package that manages that.
> >
> [snip]
>
> Because of the GIL, only 1 Python thread will actually be running at any
> time, so if it's processor-intensive, it's better to use multiprocessing.
>
> Of course, if it's already maxing out the disk, then using more cores
> won't make it faster.
> --
> https://mail.python.org/mailman/listinfo/python-list
>


-- 

**** Listen to my FREE CD at http://www.mellowood.ca/music/cedars ****
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: bob at mellowood.ca
WWW:   http://www.mellowood.ca