Lattice GW

As in the discussion of the Hubbard chain, we will consider the Hubbard model, but we will assume here a translationally invariant system with periodic boundary conditions. The translational invariance implies that all observables and propagators are diagonal in momentum space. Hence, all quantities can be labeled by the reciprocal lattice vector \(\vec k\) within the first Brillouin zone (BZ). This reduces the computational and storage complexity from \(\mathcal{O}(N_k^2)\) for the real space formalism introduced in Hubbard chain to \(\mathcal{O}(N_k)\). Moreover, the Dyson equation is diagonal in momentum space and can be efficiently parallelized using a distributed-memory parallelization based on MPI.

Synopsis 

We will consider a 1D chain described by the Hubbard model, see Eq. (43). The single particle part of the Hamiltonian can be diagonalized as

\[\hat{H}_0 = \sum_{\vec k,\sigma} \epsilon(\vec k) c_{\vec{k}\sigma}^{\dagger} c_{\vec{k}\sigma},\]

where we have introduced the free-electron dispersion \(\epsilon(\vec k)=-2 J \cos(\vec k_x)\). The system is excited via an electromagnetic excitation, which in translationally invariant systems is conveniently introduced using the Peierls substitution. The latter involves the vector potential \(\vec A(\vec r,t)\) as a phase factor in the hopping

\[J\rightarrow J \exp\Bigg[-\frac{i q}{\hbar} \int_{\vec R_i}^{\vec R_j} d\vec r \vec A(\vec r,t)\Bigg],\]

leading to a time-dependent single-particle dispersion

\[\epsilon(\vec k,t)=\epsilon(\vec k- q \vec A(t)/\hbar).\]

In this example, we treat the dynamics within the \(GW\) approximation. The numerical evaluation of the respective self-energy expressions is implemented in the C++ module gw_selfen_impl.cpp. In momentum space, the correlation part of the \(GW\) self-energy can be written as

(64)\[\Sigma_{\vec k}(t,t^\prime)=\frac{i}{N_k} \sum_{\vec q} G_{\vec k-\vec q}(t,t^\prime) \delta W_{\vec q}(t,t^\prime),\]

where we have introduced the Fourier transform of the propagator \(X_{\vec q}(t,t^\prime)=(1/N_k)\sum_{i} \exp(i (\vec r_i-\vec r_j) \vec q) X_{ij}(t,t^\prime)\), see also the previous Hubbard chain discussion for the definition of the propagators. In line with the previous notation, we have introduced the dynamical part of the effective interaction \(\delta W_{\vec q}\) via \(W_{\vec q}(t,t^\prime)=U\delta_\mathcal{C}(t,t^\prime)+\delta W_{\vec q}(t,t^\prime)\). The retarded part of the interaction is obtained as a solution of the Dyson-like equation

(65)\[W_{\vec k}(t,t')=U\delta_\mathcal{C}(t,t^\prime)+ U [ \Pi_{\vec k} \ast W_{\vec k} ](t,t^\prime),\]

the Fourier transform of the polarization is given by

(66)\[\Pi_{\vec k}(t,t')= \frac{-i}{N_k} \sum_{\vec q} G_{\vec k+\vec q}(t,t') G_{\vec q}(t',t).\]

MPI parallelization 

The main advantage of the translational invariance is that the propagators and corresponding Dyson equations are diagonal in momentum space. This leads to a significant speed-up of calculations since the most complex operation, the solutions of the VIE, can be performed in parallel. Moreover, the main bottleneck to reach longer propagation times is the memory restriction imposed by the hardware. The usage of the MPI parallelization scheme over momentum points alleviates this problem because memory can be distributed among different nodes.

The Dyson equation is diagonal in the momentum space and it is convenient to introduce a class called kpoint that includes all propagators and corresponding methods for a given momentum point \(\vec q\). In terms of data, the kpoint class includes single and two particle propagators for a given momentum point \(\vec q\), namely \(G_{\vec q}\) and \(W_{\vec q}\), as well as the self-energy \(\Sigma_{\vec q}\) and the polarization \(\Pi_{\vec q}\). The routines include wrappers around VIE2 solvers to solve the single-particle and two-particle Dyson equations.

While the Dyson equation is diagonal in momentum space, the evaluation of the self-energy diagrams mixes information between different momentum points and requires communication between different ranks. The communication between different ranks is performed using the cntr::distributed_timestep_array as described in the MPI section of the manual. Users should be careful that the simple example code presented there includes a global object of type cntr::herm_matrix_timestep_view<T>, which only allocates memory if a given momentum is owned by that rank. In the following implementation we will rather work with a local copy of this object on every rank. The latter produces only a minimal overhead in the memory usage and simplifies the implementation. The strategy to compute the \(GW\) self-energy \(\mathcal{T}[\Sigma_{\vec{k}}]_n\) at time step \(n\) thus consists of two steps:

At time \(t_n\), communicate the latest time slice of the Green’s functions \(\mathcal{T}[G_{\vec k}]_n\) and retarded interactions \(\mathcal{T}[W_{\vec k}]_n\) for all momentum points among all MPI ranks.
Evaluate the self-energy diagram \(\mathcal{T}[\Sigma_{\vec k_{\text{rank}}}]_n\) in Eq. (64) for a subset of momentum points \(\vec k_{\text{rank}}\) present on a given rank using the routine cntr::Bubble2.

Step 1 is implemented as

void gather_gk_timestep(int tstp,int Nk_rank,DIST_TIMESTEP &gk_all_timesteps,std::vector<kpoint> &corrK_rank,std::vector<int> &kindex_rank){
    gk_all_timesteps.reset_tstp(tstp);
    for(int k=0;k<Nk_rank;k++){
        gk_all_timesteps.G()[kindex_rank[k]].get_data(corrK_rank[k].G_);
    }
    gk_all_timesteps.mpi_bcast_all();
}

and an analogous routine is used for the bosonic counterpart. We gather the information from all ranks into an object of type cntr::distributed_timestep_array named gk_all_timesteps. We have introduced a map kindex_rank which maps an index of momenta from a subset on a given rank to the full BZ, like \(\mathbf{k}_{\text{rank}}\rightarrow \mathbf{k}\). The variable corrK_rank is just a collection of kpoints on a given rank. mpi_bcast_all() is a wrapper around the MPI routine Allgather adjusted to the type cntr::distributed_timestep_array.

Step 2 is implemented as

void sigma_GW(int tstp,int kk,GREEN &S,DIST_TIMESTEP &gk_all_timesteps,DIST_TIMESTEP &wk_all_timesteps,lattice_1d_1b &lattice,int Ntau,int Norb){
    assert(tstp==gk_all_timesteps.tstp());
    assert(tstp==wk_all_timesteps.tstp());

    GREEN_TSTP stmp(tstp,Ntau,Norb,FERMION);
    S.set_timestep_zero(tstp);

    for(int q=0;q<lattice.nk_;q++){
        double wk=lattice.kweight_[q];
        int kq=lattice.add_kpoints(kk,1,q,-1);
        stmp.clear();
        for(int i1=0;i1<Norb;i1++){
            for(int i2=0;i2<Norb;i2++){
                cntr::Bubble2(tstp,stmp,i1,i2,gk_all_timesteps.G()[kq],gk_all_timesteps.G()[kq],i1,i2,wk_all_timesteps.G()[q],wk_all_timesteps.G()[q],i1,i2);
            }
        }
    S.incr_timestep(tstp,stmp,wk);
    }
}

As each rank includes only a subset of momentum points \(\mathbf{k}_{\text{rank}}\) we only evaluate the self-energy diagrams \(\Sigma_{\mathbf{k}_{\text{rank}}}\) for this subset of momentum points. All ranks carry information about the latest timestep for all momentum points and the internal sum over momentum \(\vec q\) in Eq. (64) is performed on each rank.

Generic structure of the example program 

As the generic structure is similar to the above examples, we will focus on the peculiarities connected to the usage of MPI. First, we need to initialize the MPI session

MPI::Init(argc,argv);
ntasks=MPI::COMM_WORLD.Get_size();
tid=MPI::COMM_WORLD.Get_rank();
tid_root=0;

and the cntr::distributed_timestep_array for the electronic and bosonic propagators

DIST_TIMESTEP gk_all_timesteps(Nk,Nt,Ntau,Norb,FERMION,true);
DIST_TIMESTEP wk_all_timesteps(Nk,Nt,Ntau,Norb,BOSON,true);

To simplify the mapping between a subset of points on a given rank and the full BZ we introduce a map kindex_rank which at position \(j=0,\ldots,N_{\text{rank}}-1\) includes an index of the corresponding momenta in the full BZ.

The program consists of three main parts, namely solving the Matsubara Dyson equation, bootstrapping (tstp \(\leq\) SolverOrder) and time propagation for tstp \(>\) SolverOrder. The selfconsistency iterations include the communication of all fermionic and bosonic propagators between different ranks using the routine gather_gk_timestep (introduced above) and the determination of the local propagators using a routine get_loc

for(int iter=0;iter<=MatsMaxIter;iter++){
    // update propagators via MPI
    diag::gather_gk_timestep(tstp,Nk_rank,gk_all_timesteps,corrK_rank,kindex_rank);
    diag::gather_wk_timestep(tstp,Nk_rank,wk_all_timesteps,corrK_rank,kindex_rank);

    diag::set_density_k(tstp,Norb,Nk,gk_all_timesteps,lattice,density_k,kindex_rank,rho_loc);
    diag::get_loc(tstp,Ntau,Norb,Nk,lattice,Gloc,gk_all_timesteps);
    diag::get_loc(tstp,Ntau,Norb,Nk,lattice,Wloc,wk_all_timesteps);
}

As on each MPI rank, the momentum-dependent single-particle density matrix \(\rho(\vec k)\) is known for the whole BZ, the evaluation of the HF contribution is done as in Hubbard chain. The self-energies \(\Sigma_{k_{\text{rank}}}\) for the momentum points \(k_{\text{rank}}=0,\ldots,N_{\text{rank}}-1\) on each rank are obtained by the routine sigma_GW (introduced above).

// update mean field and self-energy
for(int k=0;k<Nk_rank;k++){
    diag::sigma_Hartree(tstp,Norb,corrK_rank[k].SHartree_,lattice,density_k,vertex);
    diag::sigma_Fock(tstp,Norb,kindex_rank[k],corrK_rank[k].SFock_,lattice,density_k,vertex);
    diag::sigma_GW(tstp,kindex_rank[k],corrK_rank[k].Sigma_,gk_all_timesteps,wk_all_timesteps,lattice,Ntau,Norb);
    }

Similarly, the solution of the Dyson equation for the fermionic (bosonic) propagators for each momentum point is obtained by step_dyson_with_error (step_W_with_error) which is just a wrapper around the Dyson solver which returns an error corresponding to the difference between the propagators at the previous and current iterations. The momentum-dependent error for the fermionic propagators is stored in err_ele and at the end we use MPI_Allreduce to communicate among the ranks.

// solve Dyson equation
    double err_ele=0.0,err_bos=0.0;
    for(int k=0;k<Nk_rank;k++){
      err_ele += corrK_rank[k].step_dyson_with_error(tstp,iter,SolverOrder,lattice);
      diag::get_Polarization_Bubble(tstp,Norb,Ntau,kindex_rank[k],corrK_rank[k].P_,gk_all_timesteps,lattice);
      err_bos += corrK_rank[k].step_W_with_error(tstp,iter,tstp,SolverOrder,lattice);
    }
    MPI::COMM_WORLD.Allreduce(MPI::IN_PLACE,&err_ele,1,MPI::DOUBLE_PRECISION,MPI::SUM);
    MPI::COMM_WORLD.Allreduce(MPI::IN_PLACE,&err_bos,1,MPI::DOUBLE_PRECISION,MPI::SUM);

The bootstrapping and the real-time propagation have a similar structure as the Matsubara solver. The main difference lies in the predictor-corrector scheme as explained in Hubbard chain. At the beginning of each time step, we extrapolate the momentum-dependent Green’s function \(G_{k}\) and the retarded interactions \(W_{k}\), which works as a predictor

// Predictor: extrapolation
diag::extrapolate_timestep_G(tstp-1,Nk_rank,SolverOrder,Nt,corrK_rank);
diag::extrapolate_timestep_W(tstp-1,Nk_rank,SolverOrder,Nt,corrK_rank);

Then we make several iterations at a given time step until convergence, which acts as a corrector.

After the non-equilibrium Green’s functions are obtained, we evaluate the kinetic energy \(E_{\text{kin}}(t)=\frac{1}{2 N_k}\sum_{\vec k}\text{Tr}[ \rho_{\vec k}(t) \epsilon_{\vec k}(t)]\) using the routine diag::KineticEnergy. The interaction energy is obtained from the Galitskii-Migdal formula

\[E_{\text{int}}(t)=\frac{1}{2 N_k} \sum_{\vec k} \left ( \mathrm{Tr}\left[\rho_{\vec k}(t) h^{\mathrm{MF}}_{\vec k}(t)\right] + \mathrm{Im}\mathrm{Tr} \left[\Sigma_{\vec k}\ast G_{\vec k}\right]^<(t,t) \right),\]

using the routine diag::CorrelationEnergy. The two operations include the MPI reduction as the momentum sum is performed over the whole BZ.

Running the example program 

There is one program for the \(GW\) calculation, called gw.x. The driver script demo_gw.py located in the utils/ directory provides a simple interface to this program. Similar to the previous cases, the script creates an input file and launches the program. The user can specify the shape of the electric field pulse, but by default, we use a single-frequency pulse with a Gaussian envelope

(67)\[E(t)=E_0 \exp(-4.6(t-t_0)^2/t_0^2) \sin(\omega (t-t_0)),\]

where \(t_0=2\pi/\omega N_{p}\) is determined by the number of cycles \(N_p\). After the simulation, the time evolution of the kinetic energy and potential energy are plotted.

The relevant programs for the lattice \(GW\) example are

`programs/gw.cpp`	Main program
`utils/demo_gw.py`	Running script
`programs/gw_latt_impl.cpp`	Implementation of lattice and single-particle properties
`programs/gw_selfene_impl.cpp`	Evaluation of self-energy contributions including MPI communication
`programs/gw_latt_impl.cpp`	Solution of the Dyson equation for each momentum point

Corresponding declarations of the routines are in the .hpp files with the same name as the .cpp files.

Discussion 

Here, we will briefly illustrate the dynamics after a photo-excitation. We excite the system with a short oscillating electric field pulse, see Eq. (67). The amplitude of the excitation \(E_0\) determines the absorbed energy. In Fig. 14, we present the time evolution of the kinetic energy for the two excitation strengths \(E_0=3\) and \(E_0=5\). As the energy is increased (during the pulse) and the system heats up, the kinetic energy increases. The observed behavior is consistent with thermalization at a higher temperature, but the transient evolution is complicated by the energy exchange between the electronic and bosonic subsystems (plasmon emission). For the strongest excitations, there is a clear relaxation dynamics to the final state, see inset of Time evolution of the kinetic energy for the two excitation strengths E_0=3.0, 5.0, respectively. The dashed lines show the shape of the electric field pulse scaled down by 100 to fit on the scale. The inset presents a zoom into the relaxation dynamics by subtracting the long-time limit E_\text{kin}(t)-E_\text{kin}(t_\text{fin}). Both simulations have been performed with N_k=256, time step h=0.01 and for inverse temperature \beta =20., accompanied with strongly damped oscillations.

../../_images/ekinPulse.svg — Fig. 14 Time evolution of the kinetic energy for the two excitation strengths \(E_0=3.0\), \(5.0\), respectively. The dashed lines show the shape of the electric field pulse scaled down by 100 to fit on the scale. The inset presents a zoom into the relaxation dynamics by subtracting the long-time limit \(E_\text{kin}(t)-E_\text{kin}(t_\text{fin})\). Both simulations have been performed with \(N_k=256\), time step \(h=0.01\) and for inverse temperature \(\beta =20\).

As discussed in the introduction, the usage of the MPI parallelization scheme over momentum points reduces the memory restrictions and also speeds up the execution of the program. To demonstrate this, we have performed a scaling analysis for a system with fixed number of momentum points \(N_k=128\), and parallelized up to 128 processors, see Fig. 15. Moreover, for all tests we have fixed the number of tasks per node to one, since in a real-world scenario we want to maximally distribute the memory. We can see that the scaling is almost perfect up to 128 processors, with only a slight deviation from optimal scaling. The main reason for the deviation is the communication overhead, since a substantial amount of data, namely timesteps of propagators for all momentum points, has to be communicated among all nodes.

../../_images/scaling.svg — Fig. 15 MPI scaling analysis. Speed-up of the total calculation time as a function of the MPI processes for systems with \(N_k=128\) momentum points, where we fixed one task per node. The maximum number of time steps used is \(N_t=2500\). These calculations have been performed on the Popeye cluster at the Flatiron Institute.

Lattice GW

Synopsis

MPI parallelization

Generic structure of the example program

Running the example program

Discussion

Synopsis 

MPI parallelization 

Generic structure of the example program 

Running the example program 

Discussion 