pw.x and cp.x can run in principle on any number of processors. The effectiveness of parallelization is ultimately judged by the ''scaling'', i.e. how the time needed to perform a job scales with the number of processors, and depends upon:
As a general rule, image parallelization:
A note on scaling: optimal serial performances are achieved when the data are as much as possible kept into the cache. As a side effect, PW parallelization may yield superlinear (better than linear) scaling, thanks to the increase in serial speed coming from the reduction of data size (making it easier for the machine to keep data in the cache).
VERY IMPORTANT: For each system there is an optimal range of number of processors on which to
run the job. A too large number of processors will yield performance
degradation. If the size of pools is especially delicate: Np
The optimal number of processors for "linear-algebra"
parallelization, taking care of multiplication and diagonalization
of M x M
Actual parallel performances will also depend on the available software
(MPI libraries) and on the available communication hardware. For
PC clusters, OpenMPI (http://www.openmpi.org/) seems to yield better
performances than other implementations (info by Kostantin Kudin).
Note however that you need a decent communication hardware (at least
Gigabit ethernet) in order to have acceptable performances with
PW parallelization. Do not expect good scaling with cheap hardware:
PW calculations are by no means an "embarrassing parallel" problem.
Also note that multiprocessor motherboards for Intel Pentium CPUs typically
have just one memory bus for all processors. This dramatically
slows down any code doing massive access to memory (as most codes
in the QUANTUM ESPRESSO distribution do) that runs on processors of the same
motherboard.
Next: 11 Troubleshooting
Up: 10 Performances
Previous: 10.3 File space requirements
Contents
Paolo Giannozzi
2011-07-17