LUMI GPU / Nomad CoE Hackathon

Sept 4-6, 2023

CSC, Espoo, Finland

GPAW-team:

Mikael Kusima and Jens Jørgen Mortensen
Department of Physics
Technical University of Denmark
  1. very quick introduction to DFT

  2. quick introduction to GPAW

  3. GPU-history of GPAW

  4. our CUPY implementation

  5. profiling

  6. problems and questions

Density functional theory (DFT)

Kohn-Sham equation:

\[-\frac12 \nabla^2 \psi_n(\mathbf{r}) - v_{\text{eff}}(\mathbf{r}) \psi_n(\mathbf{r}) = \epsilon_n \psi_n(\mathbf{r})\]

Electron density:

\[n(\mathbf{r}) = \sum_n |\psi_n(\mathbf{r})|^2\]

Solved iteratively: \(n(\mathbf{r}) \rightarrow v_{\text{eff}}(\mathbf{r})\). Most algorithms scale cubicly with system size.

What is GPAW?

  • PAW-based electronic structure code

  • Written in

    • Python (numpy, scipy)

    • C (BLAS, libxc, Scalapack, MPI, FFTW, ELPA, …)

  • Wave function representations:

    • plane waves (PW-mode)

    • numerical atomic orbitals (LCAO-mode)

    • uniform real-space grids (FD-mode)

GPAW-GPU timeline

  • early FD-implementation

  • … many years passes …

  • most of the FD-work merged to master (build-system, kernels, …)

  • ongoing work on rewrite of core parts of GPAW (not yet ready for production)

  • PW-implementaion based on rewrite (parallel over k-points only)

  • Used for production: bulk structure relaxation

  • Gitlab-CI (NVIDIA only)

How

  • Use cupy.ndarray objects instead of numpy.ndarray objects:

    • numpy -> cupy

    • scipy -> cupyx.scipy

  • We have GPU-enabled our Python interface to MPI: it can work with both types of arrays.

  • For testing without a GPU, we have fake Cupy modules:

    • gpaw.gpu.cpupy

    • gpaw.gpu.cpupyx.scipy

Plane-wave calculations

The (pseudo) wave functions can be expressed as:

\[\tilde\psi_n(\mathbf{r}) = \sum_{\mathbf{G}} c_{n\mathbf{G}} e^{i\mathbf{G}\cdot\mathbf{r}}\]

In Python code they will be represented by a gpaw.core.PWArray instance as sketched here:

import cupy as cp
import numpy as np
from gpaw.core import PWDesc

a = 4.0  # side-lenght of unit cell (bohr)
cell = np.eye(3) * a
pw = PWDesc(ecut=15.0,  # plane-wave cutoff in hartree
            cell=cell)
N = 5  # number of electrons
kpt = [0.5, 0.5, 0.25]
wavefunctions = pw.empty(N, kpt=kpt, xp=cp)  # use a cupy array

Electron density

The electron density is represented by a gpaw.core.UGArray instance:

from gpaw.core import UGDesc
gpts = 20  # number of grid-points
grid = UGDesc(cell=cell, size=(gpts, gpts, gpts))
density = grid.zeros(xp=cp)  # use a cupy array

Matrix elements

Matrix elements like the overlap matrix:

\[\int d\mathbf{r} \tilde\psi_n^*(\mathbf{r}) \tilde\psi_{n'}(\mathbf{r})\]

are calculated like this:

overlap = wavefunctions.matrix_elements(wavefunctions)

and the overlap object will be an instance of the gpaw.core.matrix.Matrix object and since the wave function expansion coefficients live on the GPU, the matrix object will also use its CuPy backend for storage.

PAW projections

Wavefunction-projector overlaps:

\[\langle\tilde p_i^a | \tilde \psi_n \rangle = \sum_{\textbf{G}} ...\]

Work done by gpaw.core.atom_centered_functions.AtomCenteredFunctions object:

projectors = pw.atom_centered_functions(
    splines,
    positions,
    ...,
    xp=cp)

Profiling Python code

from ase.build import mx2
from gpaw import PW, GPAW
import cProfile
atoms = mx2('MoS2',
            kind='2H',
            a=3.184,
            thickness=3.13,
            size=(1, 1, 1))
atoms.center(vacuum=3.0, axis=2)
atoms *= (4, 4, 1)
atoms.calc = GPAW(mode=PW(400),
                  parallel={'gpu': not True})
with cProfile.Profile() as p:
        atoms.get_potential_energy()
p.dump_stats('cpu.prof')

CPU-profile

Mon Aug 28 09:35:48 2023    cpu.prof

         8853403 function calls (8842082 primitive calls) in 173.769 seconds

   Ordered by: internal time
   List reduced from 1679 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    59567   52.018    0.001   52.018    0.001 {built-in method _gpaw.FFTWExecute}
     3583   36.444    0.010   36.444    0.010 {built-in method _gpaw.mmm}
     3085   20.712    0.007   20.712    0.007 {built-in method _gpaw.pwlfc_expand}
      254    9.502    0.037    9.502    0.037 {method 'lcao_to_grid' of 'LocalizedFunctionsCollection' objects}
   124826    6.181    0.000    6.181    0.000 {method 'calculate' of 'XCFunctional' objects}
      210    4.374    0.021    4.374    0.021 {built-in method _gpaw.r2k}
458190/448757    4.334    0.000    7.540    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
      104    4.172    0.040   63.092    0.607 /home/jensj/gpaw/gpaw/new/pw/hamiltonian.py:20(apply_local_potential)
    26670    3.812    0.000   28.633    0.001 /home/jensj/gpaw/gpaw/fftw.py:152(fft_sphere)
    32844    3.406    0.000    3.421    0.000 /home/jensj/gpaw/gpaw/core/uniform_grid.py:365(scatter_from)
    28356    2.902    0.000    2.902    0.000 {built-in method numpy.core._multiarray_umath.c_einsum}
       79    2.100    0.027    2.103    0.027 /home/jensj/.local/lib/python3.10/site-packages/scipy/linalg/_decomp.py:270(eigh)
     2496    2.054    0.001   12.665    0.005 /home/jensj/gpaw/gpaw/xc/lda.py:14(__call__)
    32793    2.028    0.000    2.028    0.000 {built-in method _gpaw.pw_insert}
      300    1.816    0.006    1.816    0.006 {built-in method _gpaw.symmetrize_ft}
       53    1.385    0.026    1.385    0.026 {built-in method _gpaw.rk}
        1    1.283    1.283    1.283    1.283 {method 'calculate_potential_matrix' of 'LocalizedFunctionsCollection' objects}
    26670    1.196    0.000    1.249    0.000 /home/jensj/gpaw/gpaw/core/plane_waves.py:161(cut)
        6    0.811    0.135    0.811    0.135 {built-in method _gpaw.FFTWPlan}
     6350    0.772    0.000    0.772    0.000 {built-in method _gpaw.add_to_density}
        4    0.694    0.174    1.338    0.335 /home/jensj/gpaw/gpaw/core/pwacf.py:174(set_positions)
    32793    0.636    0.000   34.539    0.001 /home/jensj/gpaw/gpaw/fftw.py:137(ifft_sphere)
   124867    0.582    0.000    0.945    0.000 /home/jensj/gpaw/gpaw/atom/radialgd.py:95(integrate)
    24478    0.458    0.000    0.469    0.000 /home/jensj/gpaw/gpaw/utilities/blas.py:229(axpy)
    13208    0.450    0.000    0.450    0.000 {built-in method _gpaw.pw_precond}
    26416    0.440    0.000    0.443    0.000 /home/jensj/gpaw/gpaw/core/plane_waves.py:406(gather_all)
   134488    0.436    0.000    0.436    0.000 {method 'conj' of 'numpy.ndarray' objects}
    73558    0.401    0.000    0.401    0.000 {method 'reduce' of 'numpy.ufunc' objects}
       80    0.325    0.004   25.289    0.316 /home/jensj/gpaw/gpaw/core/pwacf.py:272(add)
   382032    0.321    0.000    4.463    0.000 <__array_function__ internals>:177(dot)

GPU-profile

Mon Aug 28 09:35:48 2023    gpu.prof

         26393027 function calls (26381882 primitive calls) in 70.977 seconds

   Ordered by: internal time
   List reduced from 1895 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      254    9.600    0.038    9.600    0.038 {method 'lcao_to_grid' of 'LocalizedFunctionsCollection' objects}
   124826    6.466    0.000    6.466    0.000 {method 'calculate' of 'XCFunctional' objects}
    32793    4.496    0.000    9.178    0.000 /home/jensj/gpaw/gpaw/fftw.py:234(ifft_sphere)
500697/491472    4.335    0.000    4.622    0.000 {built-in method numpy.core._multiarray_umath.implement_array_function}
     3085    4.204    0.001    4.204    0.001 {built-in method _gpaw.pwlfc_expand_gpu}
    42701    2.414    0.000    2.533    0.000 {built-in method cupy._core._routines_linalg.matmul}
     2496    1.998    0.001   12.367    0.005 /home/jensj/gpaw/gpaw/xc/lda.py:14(__call__)
      300    1.816    0.006    1.816    0.006 {built-in method _gpaw.symmetrize_ft}
    26416    1.810    0.000    3.817    0.000 /home/jensj/gpaw/gpaw/fftw.py:271(fft_sphere)
      108    1.558    0.014    1.567    0.015 /home/jensj/gpaw/gpaw/core/matrix.py:513(tril2full)
   136325    1.445    0.000    1.446    0.000 {method 'conj' of 'cupy._core.core._ndarray_base' objects}
      264    1.284    0.005    1.285    0.005 {method 'get' of 'cupy._core.core._ndarray_base' objects}
        1    1.279    1.279    1.279    1.279 {method 'calculate_potential_matrix' of 'LocalizedFunctionsCollection' objects}
      104    1.152    0.011   15.861    0.153 /home/jensj/gpaw/gpaw/new/pw/hamiltonian.py:20(apply_local_potential)
    28353    1.014    0.000   10.784    0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/linalg/_einsum.py:447(einsum)
        4    0.943    0.236    1.613    0.403 /home/jensj/gpaw/gpaw/core/pwacf.py:174(set_positions)
   106978    0.768    0.000    0.768    0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/_creation/basic.py:7(empty)
    42701    0.718    0.000    2.000    0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/_core/_gufuncs.py:422(_get_args_transposed)
   110775    0.670    0.000    0.670    0.000 {built-in method cupy._core.core.array}
    59313    0.616    0.000    0.616    0.000 {method 'fft' of 'cupy.cuda.cufft.PlanNd' objects}
    42701    0.611    0.000    6.444    0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/_core/_gufuncs.py:539(__call__)
   124867    0.559    0.000    0.908    0.000 /home/jensj/gpaw/gpaw/atom/radialgd.py:95(integrate)
       25    0.534    0.021    2.694    0.108 /home/jensj/gpaw/gpaw/core/plane_waves.py:558(abs_square)
    74010    0.487    0.000    0.487    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    44005    0.450    0.000    8.260    0.000 /home/jensj/.local/lib/python3.10/site-packages/cupy/linalg/_einsum.py:359(reduced_binary_einsum)
   244466    0.432    0.000    0.432    0.000 {method 'transpose' of 'cupy._core.core._ndarray_base' objects}
    26416    0.414    0.000    0.419    0.000 /home/jensj/gpaw/gpaw/core/plane_waves.py:450(scatter_from_all)
    26670    0.402    0.000    0.484    0.000 /home/jensj/gpaw/gpaw/core/plane_waves.py:432(scatter_from)
   247711    0.381    0.000    0.540    0.000 {built-in method builtins.next}
   201917    0.365    0.000    0.365    0.000 {method 'reshape' of 'cupy._core.core._ndarray_base' objects}

Problems and questions

  • Missing cupy.cublas.syr2k (cupy.cublas.syrk is there)

  • Missing generalized eigenvalue solver (cupy.linalg.eigh(a, b))

  • Standard eigenvalue solver (cupy.linalg.eigh(a)) very slow on HIP

  • Band-parallelization:

    • synchronize before MPI-send: where, when?

    • Scalapack replacement: ELPA?

  • More bottlenecks:

    • LCAO initialization not on GPU (use random wave functions?)

    • Libxc?

  • Your questions?