LUMI GPU / Nomad CoE Hackathon¶
Sept 4-6, 2023
CSC, Espoo, Finland
GPAW-team:
Mikael Kusima and Jens Jørgen Mortensen
Department of Physics
Technical University of Denmark
very quick introduction to DFT
quick introduction to GPAW
GPU-history of GPAW
our CUPY implementation
profiling
problems and questions
Density functional theory (DFT)¶
Kohn-Sham equation:
Electron density:
Solved iteratively: \(n(\mathbf{r}) \rightarrow v_{\text{eff}}(\mathbf{r})\). Most algorithms scale cubicly with system size.
What is GPAW?¶
PAW-based electronic structure code
Written in
Python (numpy, scipy)
C (BLAS, libxc, Scalapack, MPI, FFTW, ELPA, …)
Wave function representations:
plane waves (PW-mode)
numerical atomic orbitals (LCAO-mode)
uniform real-space grids (FD-mode)
GPAW-GPU timeline¶
early FD-implementation
… many years passes …
most of the FD-work merged to master (build-system, kernels, …)
ongoing work on rewrite of core parts of GPAW (not yet ready for production)
PW-implementaion based on rewrite (parallel over k-points only)
Used for production: bulk structure relaxation
Gitlab-CI (NVIDIA only)
How¶
Use
cupy.ndarray
objects instead ofnumpy.ndarray
objects:numpy
->cupy
scipy
->cupyx.scipy
We have GPU-enabled our Python interface to MPI: it can work with both types of arrays.
For testing without a GPU, we have fake Cupy modules:
gpaw.gpu.cpupy
gpaw.gpu.cpupyx.scipy
Plane-wave calculations¶
The (pseudo) wave functions can be expressed as:
In Python code they will be represented by a gpaw.core.PWArray
instance as sketched here:
import cupy as cp
import numpy as np
from gpaw.core import PWDesc
a = 4.0 # side-lenght of unit cell (bohr)
cell = np.eye(3) * a
pw = PWDesc(ecut=15.0, # plane-wave cutoff in hartree
cell=cell)
N = 5 # number of electrons
kpt = [0.5, 0.5, 0.25]
wavefunctions = pw.empty(N, kpt=kpt, xp=cp) # use a cupy array
Electron density¶
The electron density is represented by a gpaw.core.UGArray
instance:
from gpaw.core import UGDesc
gpts = 20 # number of grid-points
grid = UGDesc(cell=cell, size=(gpts, gpts, gpts))
density = grid.zeros(xp=cp) # use a cupy array
Matrix elements¶
Matrix elements like the overlap matrix:
are calculated like this:
overlap = wavefunctions.matrix_elements(wavefunctions)
and the overlap object will be an instance of the
gpaw.core.matrix.Matrix
object and since the wave function expansion
coefficients live on the GPU, the matrix object will also use its CuPy
backend for storage.
PAW projections¶
Wavefunction-projector overlaps:
Work done by
gpaw.core.atom_centered_functions.AtomCenteredFunctions
object:
projectors = pw.atom_centered_functions(
splines,
positions,
...,
xp=cp)
Profiling Python code¶
from ase.build import mx2
from gpaw import PW, GPAW
import cProfile
atoms = mx2('MoS2',
kind='2H',
a=3.184,
thickness=3.13,
size=(1, 1, 1))
atoms.center(vacuum=3.0, axis=2)
atoms *= (4, 4, 1)
atoms.calc = GPAW(mode=PW(400),
parallel={'gpu': not True})
with cProfile.Profile() as p:
atoms.get_potential_energy()
p.dump_stats('cpu.prof')
CPU-profile¶
Mon Aug 28 09:35:48 2023 cpu.prof
8853403 function calls (8842082 primitive calls) in 173.769 seconds
Ordered by: internal time
List reduced from 1679 to 30 due to restriction <30>
ncalls tottime percall cumtime percall filename:lineno(function)
59567 52.018 0.001 52.018 0.001 {built-in method _gpaw.FFTWExecute}
3583 36.444 0.010 36.444 0.010 {built-in method _gpaw.mmm}
3085 20.712 0.007 20.712 0.007 {built-in method _gpaw.pwlfc_expand}
254 9.502 0.037 9.502 0.037 {method 'lcao_to_grid' of 'LocalizedFunctionsCollection' objects}
124826 6.181 0.000 6.181 0.000 {method 'calculate' of 'XCFunctional' objects}
210 4.374 0.021 4.374 0.021 {built-in method _gpaw.r2k}
458190/448757 4.334 0.000 7.540 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
104 4.172 0.040 63.092 0.607 /home/jensj/gpaw/gpaw/new/pw/hamiltonian.py:20(apply_local_potential)
26670 3.812 0.000 28.633 0.001 /home/jensj/gpaw/gpaw/fftw.py:152(fft_sphere)
32844 3.406 0.000 3.421 0.000 /home/jensj/gpaw/gpaw/core/uniform_grid.py:365(scatter_from)
28356 2.902 0.000 2.902 0.000 {built-in method numpy.core._multiarray_umath.c_einsum}
79 2.100 0.027 2.103 0.027 /home/jensj/.local/lib/python3.10/site-packages/scipy/linalg/_decomp.py:270(eigh)
2496 2.054 0.001 12.665 0.005 /home/jensj/gpaw/gpaw/xc/lda.py:14(__call__)
32793 2.028 0.000 2.028 0.000 {built-in method _gpaw.pw_insert}
300 1.816 0.006 1.816 0.006 {built-in method _gpaw.symmetrize_ft}
53 1.385 0.026 1.385 0.026 {built-in method _gpaw.rk}
1 1.283 1.283 1.283 1.283 {method 'calculate_potential_matrix' of 'LocalizedFunctionsCollection' objects}
26670 1.196 0.000 1.249 0.000 /home/jensj/gpaw/gpaw/core/plane_waves.py:161(cut)
6 0.811 0.135 0.811 0.135 {built-in method _gpaw.FFTWPlan}
6350 0.772 0.000 0.772 0.000 {built-in method _gpaw.add_to_density}
4 0.694 0.174 1.338 0.335 /home/jensj/gpaw/gpaw/core/pwacf.py:174(set_positions)
32793 0.636 0.000 34.539 0.001 /home/jensj/gpaw/gpaw/fftw.py:137(ifft_sphere)
124867 0.582 0.000 0.945 0.000 /home/jensj/gpaw/gpaw/atom/radialgd.py:95(integrate)
24478 0.458 0.000 0.469 0.000 /home/jensj/gpaw/gpaw/utilities/blas.py:229(axpy)
13208 0.450 0.000 0.450 0.000 {built-in method _gpaw.pw_precond}
26416 0.440 0.000 0.443 0.000 /home/jensj/gpaw/gpaw/core/plane_waves.py:406(gather_all)
134488 0.436 0.000 0.436 0.000 {method 'conj' of 'numpy.ndarray' objects}
73558 0.401 0.000 0.401 0.000 {method 'reduce' of 'numpy.ufunc' objects}
80 0.325 0.004 25.289 0.316 /home/jensj/gpaw/gpaw/core/pwacf.py:272(add)
382032 0.321 0.000 4.463 0.000 <__array_function__ internals>:177(dot)
GPU-profile¶
Mon Aug 28 09:35:48 2023 gpu.prof
26393027 function calls (26381882 primitive calls) in 70.977 seconds
Ordered by: internal time
List reduced from 1895 to 30 due to restriction <30>
ncalls tottime percall cumtime percall filename:lineno(function)
254 9.600 0.038 9.600 0.038 {method 'lcao_to_grid' of 'LocalizedFunctionsCollection' objects}
124826 6.466 0.000 6.466 0.000 {method 'calculate' of 'XCFunctional' objects}
32793 4.496 0.000 9.178 0.000 /home/jensj/gpaw/gpaw/fftw.py:234(ifft_sphere)
500697/491472 4.335 0.000 4.622 0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
3085 4.204 0.001 4.204 0.001 {built-in method _gpaw.pwlfc_expand_gpu}
42701 2.414 0.000 2.533 0.000 {built-in method cupy._core._routines_linalg.matmul}
2496 1.998 0.001 12.367 0.005 /home/jensj/gpaw/gpaw/xc/lda.py:14(__call__)
300 1.816 0.006 1.816 0.006 {built-in method _gpaw.symmetrize_ft}
26416 1.810 0.000 3.817 0.000 /home/jensj/gpaw/gpaw/fftw.py:271(fft_sphere)
108 1.558 0.014 1.567 0.015 /home/jensj/gpaw/gpaw/core/matrix.py:513(tril2full)
136325 1.445 0.000 1.446 0.000 {method 'conj' of 'cupy._core.core._ndarray_base' objects}
264 1.284 0.005 1.285 0.005 {method 'get' of 'cupy._core.core._ndarray_base' objects}
1 1.279 1.279 1.279 1.279 {method 'calculate_potential_matrix' of 'LocalizedFunctionsCollection' objects}
104 1.152 0.011 15.861 0.153 /home/jensj/gpaw/gpaw/new/pw/hamiltonian.py:20(apply_local_potential)
28353 1.014 0.000 10.784 0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/linalg/_einsum.py:447(einsum)
4 0.943 0.236 1.613 0.403 /home/jensj/gpaw/gpaw/core/pwacf.py:174(set_positions)
106978 0.768 0.000 0.768 0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/_creation/basic.py:7(empty)
42701 0.718 0.000 2.000 0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/_core/_gufuncs.py:422(_get_args_transposed)
110775 0.670 0.000 0.670 0.000 {built-in method cupy._core.core.array}
59313 0.616 0.000 0.616 0.000 {method 'fft' of 'cupy.cuda.cufft.PlanNd' objects}
42701 0.611 0.000 6.444 0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/_core/_gufuncs.py:539(__call__)
124867 0.559 0.000 0.908 0.000 /home/jensj/gpaw/gpaw/atom/radialgd.py:95(integrate)
25 0.534 0.021 2.694 0.108 /home/jensj/gpaw/gpaw/core/plane_waves.py:558(abs_square)
74010 0.487 0.000 0.487 0.000 {method 'reduce' of 'numpy.ufunc' objects}
44005 0.450 0.000 8.260 0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/linalg/_einsum.py:359(reduced_binary_einsum)
244466 0.432 0.000 0.432 0.000 {method 'transpose' of 'cupy._core.core._ndarray_base' objects}
26416 0.414 0.000 0.419 0.000 /home/jensj/gpaw/gpaw/core/plane_waves.py:450(scatter_from_all)
26670 0.402 0.000 0.484 0.000 /home/jensj/gpaw/gpaw/core/plane_waves.py:432(scatter_from)
247711 0.381 0.000 0.540 0.000 {built-in method builtins.next}
201917 0.365 0.000 0.365 0.000 {method 'reshape' of 'cupy._core.core._ndarray_base' objects}
Problems and questions¶
Missing
cupy.cublas.syr2k
(cupy.cublas.syrk
is there)Missing generalized eigenvalue solver (
cupy.linalg.eigh(a, b)
)Standard eigenvalue solver (
cupy.linalg.eigh(a)
) very slow on HIPBand-parallelization:
synchronize before MPI-send: where, when?
Scalapack replacement: ELPA?
More bottlenecks:
LCAO initialization not on GPU (use random wave functions?)
Libxc?
Your questions?