## **Kinetic Monte Carlo Simulations** at Spatiotemporal Scales of Experiments

#### Karl-Heinz Heinig

#### Helmholtz-Zentrum Dresden-Rossendorf, Germany

#### **Outline** 1. Motivation

- 2. Bit-coded, cellular automaton-based kMC code
- **3.** Applications, need for further acceleration of simulations
- 4. MPP with GPUs of conservative (Kawasaki) Ising models
- **5.** Tests and Applications
- 6. Outlook

Supported by DFG, BMBF, DAAD

Coworkers:

Bartosz Liedke,Satoshi Numazawa,Jeffrey Kelling,Lars Röntzsch,Torsten Müller,Andreas Kranz,Matthias Strobel, Geza Odor,,Henrik Schulz



HELMHOLTZ | ZENTRUM DRESDEN | ROSSENDORF

BEDMOD Workshop, March 26 - 29, 2012, Dresden

mber of the Helmholtz Association

Karl-Heinz Heinig | Institute of Ion-Beam Physics and Materials Research | http://www.hzdr.de



Randrom valk pobrasse fin treestitication Si



MD, SW potential, 1200°C, ~ 10.000 atoms



Steven Wolfram: kinetic lattice MC = statistic probabilistic cellular automaton





wire consisting of "phase 1" embedded in "phase 0" (e.g. vacuum)







double book keeping (no NN-list needed)







Jump-pair i $\rightarrow$ f has 18 NNs, i.e. there are 2<sup>18</sup>~256000 configurations.

Totalistic CA like Ising model or RGL potential: only ~ 100 different jump configurations.

Look-up table, i.e. energies calculated once.



non-volatile nanodot memory (FLASH)



Si dot formation by Si<sup>+</sup> ion implantation into SiO<sub>2</sub> followed by phase separation targettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettargettarget















#### Control of NC size by annealing conditions

 $E_{b}/kT = 2.0$  $E_{b}/kT = 3.0$  $E_{b}/kT = 4.0$ 10 kMCS 1000 kMCS 100 kMCS 10000 kMCS

increasing annealing time





## <u>Application #1</u> (European FP6 project) : Energy-Filtered TEM (EFTEM) vs kMC

# 1 keV Si<sup>+</sup> into 8nm SiO<sub>2</sub>





Good agreement ? No ! Exp. & theory differ by a factor 5 !!

Theory predictive, Exp. "wrong" !!! (humidity)





kMC-based process simulations could identify a parasitic oxidation of ion implanted silicon

APL 85 (2004) 2373



## <u>Application #2</u> (DFG priority programm): Synthesis of functional nanowires



<u>FIB</u>: 10<sup>17</sup>Co<sup>2+</sup>cm<sup>-2</sup> @70keV & 415 °C into (001)Si along <110>



Implantation of Co lines of 50 nm width

Expected CoSi<sub>2</sub> nanowire after annealing: ~10 nm





Annealing in N<sub>2</sub> for 60'@ 600 °C

and for 30'@ 1000 °C



## <u>Application #2</u> (DFG priority programm): Synthesis of functional nanowires

Many interesting applications of nanocapillarity (tube = inverse wire)







simulation parameters:

Gaussian beam profile, Width = 50 nm= 60 keVE<sub>Co</sub>  $F_{Co}(r=0) = 8x10^{16} \text{ ions/cm}^2$ Cell =  $(512 \times 512 \times 512) a^3$ #<sub>atoms</sub> = 4,743,197

almost 2 years CPU time !

decay takes 100x longer !



BEDMOD Workshop, March 26 - 29, 2012, Dresden

16

Member of the Helmholtz Association | Institute Ion-Beam Physics and Materials Research | http://www.hzdr.de



# **Beyond standard lattice kMC ?**

- a) Massively Parallel Programming
- b) Exploitation of the Cellular Automaton Concept





#### 3 atomic layers along a <110> direction





Location of point defects is similar to MD results (even relaxation)

- atoms of the (110) lattice plane having self-interstitials
- (110) lattice plane above this defective plane
- (110) lattice plane below this defective plane
- hexagonal selfinterstitial between two (110) lattice planes



Amorphization of Si and Solid Phase Epitaxial (SPE) regrowth







Nanocrystal nucleation and growth from dissolved silicon





t=1000

t=10000



# **Massive Parallel Programing (MPP) of kMC**

- a) with CPU cores (multi-core CPUs, LINUX clusters,...)
- b) with graphic cards (NVIDIAs Tesla and Fermi card, ATI,...)



# Massive Parallel Programing (MPP) of kMC with CPU cores

#### "Dead borders" concept:

- Break the simulation cell (periodic BCs) into sub-domains
- Run kMC simulations of the sub-domains in parallel
- > Avoid "talking" between sub-domains by "dead borders" (small statistical error)
- > After a "short" time, shift the global origin randomly  $\rightarrow$  other dead borders



# Massive Parallel Programing (MPP) of kMC with graphic cards (GPUs)

Schematic representation of the architecture of current GPUs.



Weigel, Int. J. of Mod. Phys. C (2011)





**<u>Registers</u>**: each multiprocessor is equipped with several thousand registers with local, zero-latency access;

**Shared memory:** processors of a multiprocessor have of on chip, small latency shared memory;

L1 and L2 caches: 16/48 kB L1 cache and 768 kB L2 cache;

**Global memory**: large amount (currently up to 6 GB) of memory on separate DRAM chips with access from every thread on each multiprocessor with a latency of several hundred clock cycles;

<u>Constant and texture memory</u>: read-only memories of the same speed as global memory, but cached;

Host memory: cannot be accessed from inside GPU functions, relatively slow transfers.



Parallel execution of a GPU program ("kernel") in a grid of thread blocks. Threads within a block work synchronously on the same data set. Different blocks are scheduled for execution independent of each other.











#### Block:

- Threads (max 512 for Tesla; 1024 for Fermi)
- Its threads access the same bank of shared memory concurrently
- Each thread should execute (exactly) the same instructions

#### <u>Grid</u>:

 The blocks of a grid (up to 65536 x 65536) are scheduled independently of each other and can only communicate via global memory accesses



### hierarchical level 1: parallelisation of threads





#### Example:

- 512 Threads/block
- 32 sites/ thread cell
- $\rightarrow$  16000 sites/block

e.g. thread cells #1 treated parallel (no overlap), then cells #2, ...



### hierarchical level 2: parallelisation of blocks





#### Example:

- 16000 sites/block
- 64 blocks/grid
- $\rightarrow$  1 mill. sites

e.g. thread blocks #1 treated parallel (no overlap), then blocks #2, ...



## Application #3 (int. BMBF project): Synthesis Si/SiO<sub>2</sub> nanocomposite (sponge) for photovoltaics



#### Fabrication:

- SiO deposition
- heating (laser)
- > phase separation
- $> 2SiO \rightarrow Si + SiO_2$

#### **Process optimisation:**

- Which time/temperature is needed @ laser annealing?
- How can the structure size tuned (band gap engineering)?
- Phase separation at interfaces?

▶ ...



# Morphology



Member of the Helmholtz Association Karkปูศมหะปหยามปฏาระโปลาสาชธรริกละอากุจากไปเหตุการและเรื่อง

### Application #3 (int. BMBF project): Synthesis Si/SiO<sub>2</sub> nanocomposite (sponge) for photovoltaics



# Application #3 (int. BMBF project): Synthesis Si/SiO<sub>2</sub> nanocomposite (sponge) for photovoltaics

Scaling of the structure size = coarsening during phase separation: <u>comparison CPU vs GPU</u>



## Comparison E5530 @ 2.4 GHz vs NVIDIA C2070: ~70x





- CA-based, bit-coded lattice kMC is now an established method for process development/optimization in nanotechnologies.
- Using the CA concept and a fine grained lattice, a method in-between MD and kMC can be established.
- Using Massively Parallel Programming MPP on GPUs, kMC can be accelerated by about 2 orders of magnitude.

