Colin DP
Table of Contents
- 1. DONE Review/get up to speed on changes
[4/4]
- 2. DONE Merge in MPI code
[10/10]
- 3. DONE Add GPU (and CPU) profiling
[8/8]
- 4. DONE Profile analysis system
[9/9]
- 5. DONE Profile discovered optimizations/fixes
[5/5]
- 6. DONE Free up more concurrency
[4/4]
- 7. DONE Change ordering of array in GPU version
[12/12]
- 8. DONE Resync/compare with reference code/LAMMPS
[8/8]
- 9. DONE Selection of GPU to use
[2/2]
- 10. DONE Implement new two pass interpolation
[4/4]
- 11. TODO Resync with reference CPU code
[2/4]
- 12. DONE Discovered code fixes
[7/7]
- 13. TODO Code cleanups
[15/31]
- 14. TODO Documentation updates
[0/4]
- 15. TODO Investigate potential GPU options before switch to HIP
[5/12]
- 16. Item for future
Scratch space for tracking progress and keep track of items on Colin's DP.
1. DONE Review/get up to speed on changes [4/4]
[X]
Added variety of items to cleanups todo[X]
Reviewed the new boundary items andgridgeo
structure (excluding the pit setup code)- type
- 0- bounce backs (bulk fluids nodes)
- 1- bounce backs (pit geometry boundary fluid nodes)
- 2- not in fluid
- orientation (number of bounce backs)
- 0- no bounce back (couldn't this be used for bulk fluid?)
- 3- not sure – all the diagonals on a face minus 1?
- 4- not sure – two faces (inside edge) minus all furthest diagonals?
- 5- against a face
- 6- not sure – seems to be face and one of the opposite diagonals?
- 8- against two faces (inside edge)
- 10 - against three faces (inside corner)
- type
[X]
Make tyson branch master- Master branch (fracnes/final-fixes tag/frances branch + one commit from Colin) is a subset of tyson branch.
[X]
Figure out where MPI branch is at (sync with tyson)- MPI branch only uncomments the transfer and exchange calls.
2. DONE Merge in MPI code [10/10]
[X]
Verify that the MPIORDERC,FORTRAN stuff is okay (per comment there may be an issue): pass[X]
Check if woundup with both GPU and CPU border exchange active: pass[X]
Fix platform initialization on later CUDAs (CPU platform won't iterate)[X]
Check that boundary exchange works with at least two processors along each dimension: fail[3/3]
- Need to set
comm_modify cutoff
to 2.5xdx
to have required particles [X]
Periodic MPI deadlocks in first steps[X]
Crash at end-of-simulation[X]
Split along z is okay, split along y has agreement issues, and split along z looses atoms
- Need to set
[X]
Fixed bug causing periodic MPI boundary exchange lockupsfixviscouslb
wasn't initialized if unused so fluid force could occur at different times
[X]
Fix end-of-simulation crashcomm
was being destructed before fixes (modify
)- Upstream pull request to fix – from discussion should switch to
post_run
hook
[X]
Double checked no other local state variables are unset: pass[X]
Switch boardary exchange to be GPU if only one processor and MPI otherwise[X]
Look into potential issues with fluid distribution exchange being in flight: pass- Constructor sets bogus fluid distribution exchange in flight
setup
(called at the start of a run) finishes the one if flight, throws it away, computes the proper one for the run, and starts it in flight- Step routines expert fluid distribution in flight at start and put it inflight at end
- Destructor finishes fluid distribution exchange in flight
[X]
Debug and fix disagreement between simulation results for MPI and non-MPI runs (two processor splits)[X]
Get particle dumping working to visualize in paraview[X]
Particles are lost when splitting x- Code for
pressurebcx
looks like it incorrectly applies to the internal side too for boundary processes
- Code for
[X]
Fixpressurebcx
applying to internal side for boundary processes[X]
Forces are off when splitting x or y[9/9]
[X]
Test with individual atoms,bodyforce
, and nopressurebcx
: fail- Issue when split along fixed boundary side, otherwise perfect agreement
[X]
Test with rigid sphere,bodyforce
, and nopressurebcx
: pass[X]
Test with rigid sphere,bodyforce
, nopressurebcx
, and pits: pass[X]
Test without particle interactions: fail (not caused by particle interaction code)[X]
Figure out how to do paraview visualization of differences- Append attributes filter on multiple inputs and then calculator for relative error
- Minimal size test example
- Error starts in split and then jumps to ends as well
[X]
Double checkedEDGE_Z{0,1}
usage for acting on both sides[X]
Test with all edge code disabled: failed[X]
Test with 3 processors along z: error on both boundaries, starting on left outer one: error on both sides of ends[X]
Looks into end extra point old adjustment code
[X]
Figure out why splitting along fixed boundary gives different results for MPI[3/3]
[X]
Add boundary dump code after each GPU routine call[X]
Create a serial vs parallel dump comparison program[X]
Extend boundary dump to entire field
[X]
Fix discovered geogrid issue (requires 3 boundary points and not 2)[X]
Add additional boundary point to calculation[X]
sublattice
initialization code useswholelattice
wrap ofNbz
: correct as only applies to boundary[X]
Update dump routines[X]
Update gpu routines to use 3 boundary offsets value and names
3. DONE Add GPU (and CPU) profiling [8/8]
[X]
Read up on profiling options[X]
Turn profiling one inclCreateCommandQueue
[X]
Add time tracking to kernels[18/18]
[X]
fluid_dist_eq_initial_3_kernel
[X]
fluid_dist_eq_next_0_kernel
[X]
fluid_dist_eqn_next_3_kernel
[X]
fluid_dist_initial_2_kernel
[X]
fluid_dist_next_2_kernel
[X]
fluid_param_initial_2_kernel
[X]
fluid_param_next_2_kernel
[X]
fluid_correctu_next_2_kernel
[X]
xboundaries_fluid_dist_eq_kernel
[X]
yboundaries_fluid_dist_eq_kernel
[X]
zboundaries_fluid_dist_eq_kernel
[X]
xboundaries_fluid_dist_kernel
[X]
yboundaries_fluid_dist_kernel
[X]
zboundaries_fluid_dist_kernel
[X]
xboundaries_fluid_force_kernel
[X]
yboundaries_fluid_force_kernel
[X]
zboundaries_fluid_force_kernel
[X]
remove_momentum_kernel
[X]
Print profiling at end[X]
Add time tracking to memory read/writes[7/7]
[X]
fluid_dist_3_1_interior_read
[X]
fluid_dist_eq_3_3_interior_read
[X]
fluid_force_2_exterior_read
[X]
fluid_force_2_accumulate_read
[X]
fluid_dist_3_1_exterior_write
[X]
fluid_dist_eq_3_3_exterior_write
[X]
fluid_force_2_accumulate_write
[X]
Debug segfault introduced in unrelated OpenCL call[1/1]
[X]
Break appart commits and bisect- Accidentally removed
queue
assignment inclCreateCommandQueue
call
- Accidentally removed
[X]
Revamp profiling to fix leaks and reduce boilerplate[4/4]
[X]
C++ template magic class to progressively push location information- Dreadful failure
[X]
Abstract with profile records: location, rank, info, value[X]
Resolve race conditions/locking by dumping different ranks to different files[X]
Switch to binary format for space- Compact everything into a location field by combining bits
- Reduce data required to pass between functions
- Replace strings with enums
[X]
Add CPU profile points for start and end of fix callouts and expensive CPU operations[8/8]
[X]
GPU timer requires OpenCL 2.1 (no NVIDIA), use CPU timer instead[X]
initial_integrate
[X]
pre_force
[X]
post_force
[X]
final_integrate
[X]
atom_read
[X]
atom_write
[X]
fluid_force_accumulate
4. DONE Profile analysis system [9/9]
[X]
Initial analysis for intra-timestep details[X]
Python code extract binary data format[X]
Switch to R as python pandas apply functionality too slow- nearly 8 minutes in python pandas vs 2 seconds in R tidyverse
- don't have a good way to load data: use python to save as a feather file
[X]
Initial plot of timestep- transfer and atom related routines seem to be taking most time
[X]
Fix issues revealed in profile code[X]
Distinguish between two uses offluid_dist_read/write
routines[X]
Fix non-uniquefluid_dist_read/write
profile points (eq vs non-eq)
[X]
Add group brackets to step breakout[X]
Distinguish between executing and non-executing state in intra analysis[X]
Add cross-timestep analysis[2/2]
[X]
Plot of walltime for each step (group analysis)- Neighbour calculation is very expensive (x100 over regular step)
[X]
Distributions of walltime for each group of GPU calls (inter analysis)
[X]
Add reference to distribution to group analysis[X]
Integrate addition of CPU clocks[X]
Redo implicit ordering calculation code (no longer simple)- Add substep calculation as CPU and GPU out of order wrt substep
[X]
Syncronize CPU and GPU clocks from data- Have to correct for slight clock skew as well as offset (linear model)
[X]
Add special cases to handle missing events in CPU only profiles
[X]
Mark run periods of all LAMMPS fix calls in analysis[X]
Drop python code required for initial loading of data
5. DONE Profile discovered optimizations/fixes [5/5]
[X]
Switchatom_sort_index
to write to recorded index position instead of resorting[10/10]
- 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 96.5 -> 84.3 (-12.6%)
[X]
Add memory for copy out space (put in end)[X]
Adjust memory resize routines[X]
Create new GPU functionatom_unsort_position2_z
and remove old ones[X]
Add kernel variables and initialize and deinitialize[X]
Add local and global size variables and initialize[X]
Calls for initial arguments[X]
Update all arguments (need to include force and index too!) on atom resize[X]
Add profile enums and remove old ones[X]
Call to invoke with final arguments[X]
Updateatom_read
[X]
Switch secondatom_sort_position2_z
to just gather new velocity values instead of resorting everything[10/10]
- 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 84.3 -> 73.5 (-12.8%)
[X]
Add memory for copy out space (put in end)[X]
Adjust memory resize routines[X]
Add new GPU functionatom_resort_position2_z
[X]
Add kernel variable and initialize and de-initialize[X]
Add local and global size variables and initialize[X]
Calls for initial arguments[X]
Update all arguments (need to include index too!) and local and global size on atom resize[X]
Add profile enum[X]
Call to invoke with final arguments[X]
Addatom_rewrite
[X]
Investigate and fix duplicate running of atom related routines- 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 73.5 -> 69.0 (-6.1%)
- If no forces added via
fix_lb_viscous_gpu
and orfix_lb_rigid_pc_sphere_gpu
could skip a bunch of atom routines includingcorrect_u
- The positions don't change so the sort mapping is fixed throughout
- The 1/2 step velocity is the correct one to be using and not the 1 step as is being done
- Push
atom_force
down down topost_force
(more lammps standard and would like to be able to add forces into GPU routine) - Technically
correct_u
in is afinal_integrate
sort of thing so put it there Steps overview
Routine Operation fluid_dist_exchange_finish-1
Recieve fluid_dist_3_new-
exterior [G+S+F] (for previous step)fluid_dist_eq_exchange_finish
Recieve fluid_dist_eq_3_new
exterior [1..G+S+F]fluid_dist_eqn_next_3
fluid_dist_eq_3_new
[0..G+S+F] ->fluid_dist_eqn_3_new
[0..G+S+F]fluid_dist_next_2
fluid_dist_~{eq,eqn}_3_{new,old} [0..G+S+F], ~fluid_dist_3_old
[0..G+S+F],fluid_density_2
[..F] ->fluid_dist_3_new
[0..G+S]restartWrite
fluid_dist_eq_3_new
[0..G+S+F],fluid_dist_3_new
[0..G+S] -> restart file (if requested)fluid_dist_exchange_start
Send fluid_dist_3_new
interior [-G-S-F]fluid_param_next_2_start
Start fluid_dist_3_new
[0..G+S] ->fluid_velocity_2
[0..G+S],fluid_density_2
[0..G+S]fluid_param_next_2_finish
End fluid_dist_3_new
[0..G+S] ->fluid_velocity_2
[0..G+S],fluid_density_2
[0..G+S]atom_write
Atom [G] -> atom_position2
[G],atom_velocity
[G],atom_type
[G],atom_mass
[G]atom_position2_z_index
atom_position2
[G] ->atom_position2_z
[G],atom_index
[G]atom_sort_position2_z
Sort by z atom_position2
[G],atom_velocity
[G],atom_mass
[G],atom_type
[G],atom_position2_z
[G],atom_index
[G]atom_force
fluid_velocity_2
[0..G+S],fluid_density_2
[0..G+S],atom_position2
[G],atom_velocity
[G],atom_mass
[G],atom_type
[G] ->atom_force
[G]fluid_force_next_2
atom_position2
[G],atom_position2_z
[G],atom_force
[G] -> localfluid_force_2
[0..G+S]fluid_force_exchange_start
Send local fluid_force_2
exterior [0..G+S]fluid_force_exchange_finish
Recieve remote fluid_force_2
interior [-G-S..0]Local fluid_force_2
[0], remotefluid_force_2
[0] ->fluid_force_2
[0]fluid_correctu_next_2
fluid_velocity_2
[0..G+S],fluid_density_2
[0..G+S],fluid_force_2
[0..G+S] ->fluid_velocity_2
[0..G+S]fluid_dist_eq_next_0+1
fluid_velocity_2
[0],fluid_density_2
[0..D],fluid_force_2
[0] -> ~fluiddisteq3new~+ [0] (for next step)fluid_dist_eq_exchange_start+1
Send ~fluiddisteq3new~+ interior [-G-S-F..-1] (for next step) atom_sort_index/atom_unsort_position2_z
Sort by index atom_force
[G],atom_index
[G]atom_read
atom_force
[G] -> hydroF [G]Reorganization
Pre- fix_lb_rigid_pc_sphere_gpu
Post- fix_lb_rigid_pc_sphere_gpu
Post- fix_lb_rigid_pc_sphere_gpu
New Step Lammps (with fix_lb_viscous_gpu
)(without fix_lb_viscous_gpu
)initial_integrate
coordinates 1 fluid_dist_exchange_finish-1
fluid_dist_exchange_finish-1
fluid_dist_exchange_finish-1
fluid_dist_exchange_finish-1
velocities 1/2 fluid_dist_eq_exchange_finish
fluid_dist_eq_exchange_finish
fluid_dist_eq_exchange_finish
fluid_dist_eq_exchange_finish
fluid_dist_eqn_next_3
fluid_dist_eqn_next_3
fluid_dist_eqn_next_3
fluid_dist_eqn_next_3
fluid_dist_next_2
fluid_dist_next_2
fluid_dist_next_2
fluid_dist_next_2
restartWrite
restartWrite
restartWrite
restartWrite
fluid_dist_exchange_start
fluid_dist_exchange_start
fluid_dist_exchange_start
fluid_dist_exchange_start
fluid_param_next_2_start
fluid_param_next_2_start
fluid_param_next_2_start
fluid_param_next_2_start
post_integrate
pre_exchange
pre_neighbour
post_neighbour
pre_force
fluid_param_next_2_finish
fluid_param_next_2_finish
fluid_param_next_2_finish
fluid_param_next_2_finish
atom_write
atom_write
atom_write
atom_write
atom_position2_z_index
atom_position2_z_index
atom_position2_z_index
atom_position2_z_index
atom_sort_position2_z
atom_sort_position2_z
atom_sort_position2_z
atom_sort_position2_z
atom_force
atom_force
atom_force
fluid_force_next_2
fluid_force_next_2
fluid_force_exchange_start
fluid_force_exchange_start
fluid_force_exchange_finish
fluid_force_exchange_finish
fluid_correctu_next_2
fluid_dist_eq_next_0+1
fluid_dist_eq_next_0+1
fluid_dist_eq_exchange_start+1
fluid_dist_eq_exchange_start+1
pre_reverse
post_force
additional forces atom_sort_index
atom_sort_index
atom_sort_index
atom_force
atom_read
atom_read
atom_read
fluid_force_next_2
fluid_force_exchange_start
fluid_force_exchange_finish
atom_unsort_position2_z
atom_read
final_integrate
velocity 1 fluid_correctu_next_2
fluid_dist_eq_next_0+1
fluid_dist_eq_exchange_start+1
end_of_step
atom_write
atom_position2_z_index
atom_sort_position2_z
atom_force
fluid_force_next_2
fluid_force_exchange_start
fluid_force_exchange_finish
fluid_correctu_next_2
fluid_dist_eq_next_0+1
fluid_dist_eq_exchange_start+1
[X]
Queuing multiple jobs appears to sometimes slowdown unrelated computation routines- Defective riser card on GPU asserting power brake and cutting clocks to 1/3rd
[X]
Duplicate rectange read in profile code (code wasn't currently used)
6. DONE Free up more concurrency [4/4]
[X]
Remove blocking OpenCL calls (all explicit dependencies)
Done Routine Input Buffers Source Output Buffers INITIAL INTEGRATE X fluid_dist_sync_13_finish
(prev)fluid_dist_3
[..0] (prev)fluid_vcm_remove_0
[..0] (prev)fluid_dist_3
[1..G+S+F] (prev)X fluid_dist_eq_sync_13_finish
(prev)fluid_dist_eq_3
[..0] (prev)fluid_vcm_remove_0
[..0] (prev)fluid_dist_eq_3
[1..G+S+F] (prev)X fluid_dist_eqn_sync_13_finish
(prev)fluid_dist_eqn_3
[..0] (prev)fluid_vcm_remove_0
[..0] (prev)fluid_dist_eqn_3
[1..G+S+F] (prev)X fluid_dist_eq_sync_13_finish
(vcm)fluid_dist_eq_3
[..0] (vcm)fluid_dist_eq_next_0
[..0]fluid_dist_eq_3
[1..G+S+F] (vcm)X fluid_dist_eqn_next_3
fluid_dist_eq_3
[..G+S+F] (vcm)fluid_dist_eq_sync_13_finish
[1..G+S+F] (vcm)fluid_dist_eqn_3
[..G+S+F] (vcm)X fluid_dist_next_2
fluid_dist_3
[..G+S+F] (prev)fluid_dist_sync_13_finish
[1..G+S+F] (prev)fluid_dist_3
[..G+S] (vcm)fluid_dist_eq_3
[..G+S+F] (prev)fluid_dist_eq_sync_13_finish
[1..G+S+F] (prev)fluid_dist_eqn_3
[..G+S+F] (prev)fluid_dist_eqn_sync_13_finish
[1..G+S+F] (prev)fluid_dist_eq_3
[..G+S+F] (vcm)fluid_dist_eq_sync_13_finish
[1..G+S+F]fluid_dist_eqn_3
[..G+S+F] (vcm)fluid_dist_eqn_next_3
[..G+S+F]X fluid_param_next_2
fluid_dist_3
[..G+S] (vcm)fluid_dist_next_2
[..G+S] (vcm)fluid_density_2
[..G+S]fluid_velocity_2
[..G+S] (half) (vcm)PRE FORCE X atom_nonforce_write_1
atom
[..G]atom_position2
[..G]atom_velocity
[..G]atom_type
[..G]atom_mass
[..G]X atom_orders_compute_1
atom_position2
[..G]atom_nonforce_write_1
[..G]atom_position2_z
[..G]atom_index
[..G]X atom_nonforce_sort_1
atom_position2
[..G]atom_nonforce_write_1
[..G]atom_position2
[..G] (sorted)atom_velocity
[..G]atom_nonforce_write_1
[..G]atom_velocity
[..G] (sorted)atom_mass
[..G]atom_nonforce_write_1
[..G]atom_mass
[..G] (sorted)atom_type
[..G]atom_nonforce_write_1
[..G]atom_type
[..G] (sorted)atom_position2_z
[..G]atom_orders_compute_1
[..G]atom_position2_z
[..G] (sorted)atom_index
[..G]atom_orders_compute_1
[..G]atom_index
[..G] (sorted)POST FORCE X atom_force_write_1
atom
[..G]atom_force
[..G]X atom_force_next_1
fluid_density_2
[..G+S]fluid_param_next_2
[..G+S]atom_force
[..G] (sorted)fluid_velocity_2
[..G+S] (half) (vcm)fluid_param_next_2
[..G+S] (half) (vcm)atom_position2
[..G] (sorted)atom_inputs_sort_1
[..G] (sorted)atom_velocity
[..G] (sorted)atom_inputs_sort_1
[..G] (sorted)atom_mass
[..G] (sorted)atom_inputs_sort_1
[..G] (sorted)atom_type
[..G] (sorted)atom_inputs_sort_1
[..G] (sorted)X fluid_force_next_2
atom_position2
[..G] (sorted)atom_inputs_sort_1
[..G] (sorted)fluid_force_2
[..G+S] (local)atom_position2_z
[..G] (sorted)atom_inputs_sort_1
[..G] (sorted)atom_force
[..G] (sorted)atom_force_next_1
[..G] (sorted)X fluid_force_sync_20_start
fluid_force_2
[0..G+S] (local)fluid_force_next_2
[..G+S] (local)X fluid_force_sync_20_finish
fluid_force_2
[-G-S..0] (local)fluid_force_next_2
[..G+S] (local)fluid_force_2
[-G-S..0]X fluid_param_u_correct_0
fluid_velocity_2
[..0] (half) (vcm)fluid_param_next_2
[..G+S] (half) (vcm)fluid_velocity_2
[..0] (vcm)fluid_density_2
[..0]fluid_param_next_2
[..G+S]fluid_force_2
[..0]fluid_force_sync_20_finish
[..0]X atom_force_unsort_1
atom_force
[..G] (sorted)atom_force_next_1
[..G] (sorted)atom_force
[..G]atom_index
[..G] (sorted)atom_inputs_sort_1
[..G] (sorted)X atom_force_read_1
atom_force
[..G]atom_force_unsort_1
[..G]atom
[..G]FINAL INTEGRATE END OF STEP X vcm_total_calc_0
atom
[..G]vcm_total
fluid_density_2
[..0]fluid_param_next_2
[..G+S]fluid_velocity_2
[..0] (vcm)fluid_param_u_correct_0
[..0] (vcm)X fluid_vcm_remove_0
vcm_total
vcm_total_calc_0
fluid_velocity_2
[..0]fluid_velocity_2
[..0] (vcm)fluid_param_u_correct_0
[..0] (vcm)fluid_dist_3
[..0]fluid_dist_3
[..0] (vcm)fluid_dist_next_2
[..G+S] (vcm)fluid_dist_eq_3
[..0]fluid_dist_eq_3
[..0] (vcm)fluid_dist_eq_next_0
[..0] (vcm)fluid_dist_eqn_3
[..0]fluid_dist_eqn_3
[..0] (vcm)fluid_dist_eqn_next_3
[..G+S+F] (vcm)X restartWrite
fluid_force_2
[..0]fluid_force_sync_20_finish
[..0]fluid_dist_3
[..0]fluid_vcm_remove_0
[..0]fluid_dist_eq_3
[..0]fluid_vcm_remove_0
[..0]fluid_dist_eqn_3
[..0]fluid_vcm_remove_0
[..0]X fluid_dist_sync_13_start
fluid_dist_3
[-G-S-F..-1]fluid_vcm_remove_0
[..0]X fluid_dist_eq_sync_13_start
fluid_dist_eq_3
[-G-S-F..-1]fluid_vcm_remove_0
[..0]X fluid_dist_eqn_sync_13_start
fluid_dist_eqn_3
[-G-S-F..-1]fluid_vcm_remove_0
[..0]X fluid_dist_eq_next_0
(next)fluid_density_2
[..D]fluid_param_next_2
[..G+S]fluid_dist_eq_3
[..0] (next) (vcm)fluid_velocity_2
[..0]fluid_vcm_remove_0
[..0]fluid_force_2
[..0]fluid_force_sync_20_finish
[..0]X fluid_dist_eq_sync_13_start
(next) (vcm)fluid_dist_eq_3
[-G-S-F..-1] (next) (vcm)fluid_dist_eq_next_0
[..0] (vcm) (next)fluid_correctu_next_0
(nowfluid_param_u_correct_0
) adjustsfluid_velocity_2 [0]
- doesn't adjust any of
fluid_dist*_3 [0]
(this is correct)
- doesn't adjust any of
fluid_momentum_remove
(nowfluid_vmc_remove_0
) adjustsfluid_dist_3 [0]
andfluid_dist_eq_3 [0]
- should use corrected
fluid_velocity_2 [0]
instead of computing uncorrected version fromfluid_dist*_3 [0]
- need to also correct
fluid_dist_eqn_3 [0..G+S+F]
forfluid_dist_next_2
if using exponential integrator - can't compute border points
[...G+S+F]
(only havefluid_velocity_2 [0]
) so need to trasfer these - need to delay
fluid_dist_exchange_3_start
until after this - need to send borders on other outputs too if using exponential integrator
- should use corrected
fluid_dist_eq_next_0
computesfluid_dist_eq_3 [0]
(next)- from
fluid_velocity_2 [0]
(affected byfluid_correctu_next_0
but not affected byfluid_momentum_remove
)
- from
fluid_dist_eqn_next_3
computesfluid_dist_eqn_3 [0..G+S+F]
(next)- from
fluid_dist_eq_3 [0]
(affected byfluid_correctu_next_0
but not affected byfluid_momentum_remove
) - could run earlier if
fluid_vmc_remove_0
isn't being run this timestep
- from
fluid_dist_next_2
computesfluid_dist_3 [0..G+S]
(next)- from
fluid_dist_3 [0..G+S]
(not affected byfluid_correctu_next_0
and not affected byfluid_momentum_remove
) - from
fluid_dist_{eq,eqn}_3 [0..G+S+F]
(not affected byfluid_correctu_next_0
but affected byfluid_momentum_remove
) - from
fluid_dist_{eq,eqn}_3 [0..G+S+F]
(next) (affected byfluid_correctu_next_0
but not affected byfluid_momentum_remove
)
- from
INITIAL INTEGRATE fluid_dist_exchange_13_finish
(prev)fluid_dist_3_action
(impl)fluid_dist_eq_exchange_13_finish
(prev)fluid_dist_eq_3_action
(impl)fluid_dist_eqn_exchange_13_finish
(prev)fluid_dist_eqn_3_action
(impl)fluid_dist_eq_exchange_13_vcm_finish
fluid_dist_eq_3_vcm_action
(impl)fluid_dist_eqn_next_3
fluid_dist_eqn_3_vcm_action
fluid_dist_next_2
fluid_dist_2_vcm_action
restartWrite
fluid_param_next_2
fluid_velocity_2_half_vcm_action
fluid_density_2_action
PRE FORCE atom_lammps_write_1
atom_position1_action
atom_velocity_1_action
atom_type_1_action
atom_mass_1_action
atom_orders_compute_1
atom_position2_z_1_action
atom_index_1_action
atom_inputs_sort_1
atom_position2_1_sorted_action
atom_velocity_1_sorted_action
atom_mass_1_sorted_action
atom_type_1_sorted_action
atom_position2_z_1_sorted_action
atom_index_1_sorted_action
POST FORCE atom_force_next_1
atom_force_1_sorted_action
fluid_force_next_2
fluid_force_2_local_action
fluid_force_exchange_02_start
fluid_force_exchange_20_finish
fluid_force_20_remote_action
(impl)fluid_force_combine_20
fluid_force_0_action
atom_force_unsort_1
atom_force_1_action
atom_force_read_1
FINAL INTEGRATE (impl) FINAL INTEGRATE fluid_param_u_correct_0
fluid_velocity_0_vcm_action
(formally fluid_correctu_next_0
)END OF STEP vcm_total_calc_0
vcm_total_action
(impl)fluid_vmc_remove_0
fluid_velocity_0_action
(formally fluid_momentum_remove
)fluid_dist_0_action
fluid_dist_eq_0_action
fluid_dist_eqn_0_action
fluid_dist_exchange_13_start
fluid_dist_eq_exchange_13_start
fluid_dist_eqn_exchange_13_start
fluid_dist_eq_next_0
(next)fluid_dist_eq_0_vcm_action
fluid_dist_eq_exchange_13_vcm_start
(next)[X]
Print routines should wait on actions/verify range- have to change pointers to values
[X]
Update profile analysis code for changes[X]
Can floatfluid_param_u_correct_0
up topost_force
7. DONE Change ordering of array in GPU version [12/12]
- Can convert routines one at a time by wrapping with transpose code
[X]
Have a look at paraview dump code[X]
Switch order of array in kernels[7/7]
- As we are manually calculating from a 1D index, it is technically aligned now
[X]
cl_mem fluid_gridgeo_3_mem
[X]
cl_mem fluid_dist_3_mem
[X]
cl_mem fluid_dist_eq_3_mem
[X]
cl_mem fluid_dist_eqn_3_mem
[X]
cl_mem fluid_density_2_mem
[X]
cl_mem fluid_velocity_2_mem
[X]
cl_mem fluid_force_2_mem
[X]
Fix offsetting in dump routines[7/7]
[X]
print_fluid
[X]
print_internal
[X]
buffer_read_rectangle
[X]
buffer_write_rectangle
[X]
fluid_force_accumulate_rectangle
[X]
calc_mass_momentum
[X]
calc_MPT
[X]
Switch order offluid_grid_geo_3_mem
- Map memory, directly initialize with strides, and drop
sublattice
- Map memory, directly initialize with strides, and drop
[X]
Reverse i,j,k loops for optimal stepping[X]
Test new order[2/2]
[X]
Test MPI boundary exchange[X]
Test GPU boundary exchange
[X]
Fix initialization offluid_grid_geo_3_mem
- Copied from
sublattice
viabuffer_create
withCL_MEM_COPY_HOST_PTR
- Wrong type in
sizeof
for memory map
- Copied from
[X]
Fix MPI boundary exchange- Hadn't updated even/odd offset to now be on
k
instead ofi
(largest step)
- Hadn't updated even/odd offset to now be on
[X]
Switch to 3D threads[X]
Invesigate Z-curve coordinate for atoms positions (current is C order)[X]
Test restart file- Are old restart files still valid: no, now in Fortran order
[X]
Fix restart files under MPI- Wrong structure element count in MPI file view type declaration
8. DONE Resync/compare with reference code/LAMMPS [8/8]
[X]
Comparison with reference code- Reference code recomputes forces after full step when not used with fixlbviscousgpu
- this was introduced with fixlbrigidpcsphere
- recomputed fluid force used to recompute equilibrium distribution too
- Leaky borders
- force stencils can extend into walls
- pits derivative isn't one-sided along walls (is in CPU code, but doesn't matter as
kappa_lb=0
)
pressurebcx
treats duplicate end point differently causing ghost point divergence
- Reference code recomputes forces after full step when not used with fixlbviscousgpu
[X]
Updatepressurebcx
to use symmetrical- Use the same point on both sides with adjustwent on both left and right on both
- Update to the newer density adjustment method
[X]
Save restart file at end point like particles are saved- Need to save velocity as well as updated by
fluid_param_u_correct_0
- Need to update distribution velocities to give final step restart (see cpu code)
- Need to save velocity as well as updated by
[X]
Leaky edges with pits (shows as dependency on number of ghost points transfered)GPU
bbf
only bounces back diagonals on edges if both components are wall normal- orientation 7 isn't bouncing back directions: 9-13 (likely reversed with 6)
- orientation 20 isn't bouncing back directions: 7-8, 13-14 (likely reversed with 19)
ORI Wall Normal Constructive Solid Geometry Bounce Back Components Original (if different) 0 -
-
1 (+1, 0, 0)
1, 7,10,11,14
2 ( 0,+1, 0)
2, 7, 8,11,12
3 ( 0, 0,+1)
5, 7, 8, 9,10
4 (+1, 0, 0)
and( 0, 0,+1)
7,10
1, 5, 7,10
5 (+1, 0, 0)
or( 0, 0,+1)
1, 5, 7, 8, 9,10,11,14
6 ( 0,+1, 0)
and( 0, 0,+1)
7, 8
2, 5, 7, 8, 9,10,11,12
7 ( 0,+1, 0)
or( 0, 0,+1)
2, 5, 7, 8, 9,10,11,12
2, 5, 7, 8
8 (-1, 0, 0)
or( 0,-1, 0)
3, 4, 8, 9,10,12,13,14
9 (-1, 0, 0)
or( 0,+1, 0)
2, 3, 7, 8, 9,11,12,13
10 (+1, 0, 0)
and( 0, 0,+1)
or( 0,+1, 0)
and( 0, 0,+1)
7, 8,10
11 (+1, 0, 0)
and( 0, 0,+1)
or( 0,-1, 0)
and( 0, 0,+1)
7, 9,10
12 (-1, 0, 0)
or( 0,+1, 0)
or( 0, 0, +1)
2, 3, 5, 7, 8, 9,10,11,12,13
13 (-1, 0, 0)
or( 0,-1, 0)
or( 0, 0, +1)
3, 4, 5, 7, 8, 9,10,12,13,14
14 (-1, 0, 0)
3, 8, 9,12,13
15 ( 0,-1, 0)
4, 9,10,13,14
16 ( 0, 0,-1)
6,11,12,13,14
17 (-1, 0, 0)
and( 0, 0, 1)
8, 9
3, 5, 8, 9
18 (-1, 0, 0)
or( 0, 0, 1)
3, 5, 7, 8, 9,10,12,13
19 ( 0,-1, 0)
and( 0, 0, 1)
9,10
4, 5, 7, 8, 9,10,13,14
20 ( 0,-1, 0)
or( 0, 0, 1)
4, 5, 7, 8, 9,10,13,14
4, 5, 9,10
21 ( 1, 0, 0)
or( 0,-1, 0)
1, 4, 7, 9,10,11,13,14
22 ( 1, 0, 0)
or( 0, 1, 0)
1, 2, 7, 8,10,11,12,14
23 (-1, 0, 0)
and( 0, 0, 1)
or( 0, 1, 0)
and( 0, 0, 1)
7, 8, 9
24 (-1, 0, 0)
and( 0, 0, 1)
or( 0,-1, 0)
and( 0, 0, 1)
8, 9,10
25 ( 1, 0, 0)
or( 0, 1, 0)
or( 0, 0, 1)
1, 2, 5, 7, 8, 9,10,11,12,14
26 ( 1, 0, 0)
or( 0,-1, 0)
or( 0, 0, 1)
1, 4, 5, 7, 8, 9,10,11,13,14
27 ( 0, 1, 0)
or( 0, 0,-1)
2, 6, 7, 8,11,12,13,14
28 ( 0,-1, 0)
or( 0, 0,-1)
4, 6, 9,10,11,12,13,14
29 ( 1, 0, 0)
and( 0, 0, 1)
or( 0, 1, 0)
2, 7, 8,10,11,12
30 ( 1, 0, 0)
and( 0, 0, 1)
or( 0,-1, 0)
4, 7, 9,10,13,14
31 (-1, 0, 0)
and( 0, 0, 1)
or( 0, 1, 0)
2, 7, 8, 9,11,12
32 (-1, 0, 0)
and( 0, 0, 1)
or( 0,-1, 0)
4, 8, 9,10,13,14
[X]
Clean up old branches (just master from main tyson)[X]
Add vector outputs as in current non-GPU code[2/2]
- See CPU
compute_vector
and documented at end of initial comments [X]
Scalar is temperature, isn't working 100% can skip[X]
Addatom_kinetic
(temperature) mirroringatom_spread
(allocation, etc.)[X]
Need to unsort these (don't actually as just need to sum)[X]
Compute degrees of freedom foratom_spread
[X]
Vector (length 4) is total mass and total momentum
- See CPU
[X]
Resync with upstream LAMMPS (make some notes on this)- need to build the include file for the kernel code (could include it)
[X]
Update IO, parsing, and string handling
9. DONE Selection of GPU to use [2/2]
[X]
Specify which OpenCL devices to use- Default selection
- Hostname based overrides
[X]
Print OpenCL devices used
10. DONE Implement new two pass interpolation [4/4]
- Allows exact node mass calculation (non-set gamma)
- Mass ratio requires particle and fluid mass
- Can re-weight forces on particle
[X]
Compute the normalization weights- Can do the weight interpolation in
inital_integrate
if we ensurenve
goes first (need to warn user if not the case) - Or just put it in the
post_integrate
(put it inpre_force
for now as haven't hookedpost_integrate
)
- Can do the weight interpolation in
[X]
Get forces onto GPU for spreading it between fluid and particles to tie velocities together
- Force on particle k = - hydroforce + mparticle/(mparticle + mstencilfluid) Fparticle-particlek
- Force on fluid from particle k = hydroforce + mstencilfluid/(mparticle + mstencilfluid) Fparticle-particlek
- Can just store the first as the difference works out to Fparticle-particlek
stencil_density*area * dm_lb
ism_stencil_fluid
(at end of compute gamma interaction factor comment)- The
lbviscous
routine then would just overwrite the force
https://www.sciencedirect.com/science/article/pii/S0010465522000364?via%3Dihub
- Need interpolated fluid mass for CPU calculation (both for force calculation and for temperature)
- Need to send LAMMPS forces to GPU for force calculation
[X]
Redo initialization and restart to maintain momentum[X]
Implement gamma scaling and negative value hack
11. TODO Resync with reference CPU code [2/4]
[X]
Linear initialization[X]
Stencils[2/2]
[X]
Add IBM3[X]
Replace Peskin with Keys
[-]
Remove higher order variant[1/2]
[X]
Remove explicit code[ ]
Remove extra transfers
fluid_dist_next_2
:fluid_dist_eq_3_{old,new}
,fluid_dist_eqn_3_old
fluid_vcm_remove_0
: depends on what is needed
Routine Input Buffers Source Output Buffers RESTART InitializeFirstRun
fluid_force_2
[..0]fluid_dist_3
[..0]fluid_dist_eq_3
[..0]fluid_dist_eqn_3
[..0]fluid_dist_sync_13_start
fluid_dist_3
[-G-S-F..-1]InitializeFirstRun
[..0]fluid_dist_3
[-G-S-F..-1]fluid_dist_eq_sync_13_start
fluid_dist_eq_3
[-G-S-F..-1]InitializeFirstRun
[..0]fluid_dist_eq_3
[-G-S-F..-1]fluid_dist_eqn_sync_13_start
fluid_dist_eqn_3
[-G-S-F..-1]InitializeFirstRun
[..0]fluid_dist_eqn_3
[-G-S-F..-1]SETUP fluid_dist_eq_sync_13_finish
(next) (vcm)fluid_dist_eq_3
[..0] (next) (vcm)fluid_dist_eq_3
[..G+S+F] (next) (vcm)fluid_dist_eq_3
[1..G+S+F] (next) (vcm)fluid_dist_sync_13_finish
fluid_dist_3
[..0]fluid_dist_3
[..G+S+F]fluid_dist_3
[1..G+S+F]manual reset fluid_dist_3
[..G+S+F]X fluid_dist_3
[..G+S+F] (vcm)fluid_param_next_2
fluid_dist_3
[..G+S] (vcm)X fluid_density_2
[..G+S/D]fluid_velocity_2
[..G+S] (half) (vcm)atom_nonforce_write_1
atom
[..G]atom_position2
[..G]atom_velocity
[..G]atom_type
[..G]atom_mass
[..G]atom_orders_compute_1
atom_position2
[..G]atom_position2_z
[..G]atom_index
[..G]atom_nonforce_sort_1
atom_position2
[..G]atom_position2
[..G] (sorted)atom_velocity
[..G]atom_velocity
[..G] (sorted)atom_mass
[..G]atom_mass
[..G] (sorted)atom_type
[..G]atom_type
[..G] (sorted)atom_position2_z
[..G]atom_position2_z
[..G] (sorted)atom_index
[..G]X atom_index
[..G] (sorted)fluid_weight_sum_2
atom_position2
[..G] (sorted)fluid_weight_2
[..G+S] (local)atom_position2_z
[G..] (sorted)fluid_weight_sync_22_start
fluid_weight_2
[-G-S..G+S] (local)fluid_weight_2
[-G-S..G+S] (local)atom_force_write_1
atom
[..G]atom_force
[..G]atom_force_sort_1
atom_force
[..G]atom_force
[..G] (sorted)atom_index
[..G] (sorted)fluid_weight_sync_22_finish
fluid_weight_2
[..G+S] (local)fluidweight2 [-G-S..G+S] fluid_weight_2
[-G-S..G+S] (remote)atom_force_next_1
fluid_weight_2
[..G+S]atom_spread
[..G] (sorted)fluid_density_2
[..G+S]atom_kinetic
[..G] (sorted)fluid_velocity_2
[..G+S] (half) (vcm)atom_force
[..G] (fluid) (sorted)atom_position2
[..G] (sorted),atom_velocity
[..G] (sorted)atom_mass
[..G] (sorted)atom_type
[..G] (sorted)atom_force
[..G] (sorted)fluid_force_next_2
(half)atom_position2
[..G] (sorted)fluid_force_2
[..G+S] (local)atom_position2_z
[..G] (sorted)atom_spread
[..G] (sorted),atom_force
[..G] (fluid) (sorted)fluid_weight_2
[..G+S]fluid_force_2
[..0]fluid_force_sync_20_start
fluid_force_2
[0..G+S] (local)fluid_force_2
[0..G+S] (local)fluid_force_sync_20_finish
fluid_force_2
[0..G+S] (local)X fluid_force_2
[..0]fluid_force_2
[-G-S..0] (remote)fluid_param_u_correct_0
fluid_density_2
[..0]X fluid_velocity_2
[..0] (vcm)fluid_velocity_2
[..0] (half) (vcm)fluid_force_2
[..0]atom_force_unsort_1
atom_force
[..G] (fluid) (sorted)X atom_force
[..G] (fluid)X atom_index
[..G] (sorted)atom_force_read_1
X atom_force
[..G] (fluid)atom_force_unsort_1
[..G] (fluid)X atom
[..G]manual reset X fluid_velocity_2
[..0] (vcm)fluid_param_next_2
[..G+S] (half) (vcm) -half-?X fluid_velocity_2
[..0]manual reset X fluid_dist_3
[..G+S+F] (vcm)manual reset [..G+S+F] (vcm) X fluid_dist_3
[..G+S+F]fluid_dist_eq_next_0
X fluid_density_2
[..D]fluid_param_next_2
[..G+S/D]X fluid_dist_eq_3
[..0] (next) (vcm)X fluid_velocity_2
[..0]manual reset [..0] X fluid_force_2
[..0]fluid_force_sync_20_finish
[..0]fluid_dist_eq_sync_13_start
X fluid_dist_eq_3
[-G-S-F..-1] (next) (vcm)fluid_dist_eq_next_0
[..0] (next) (vcm)X fluid_dist_eq_3
[-G-S-F..-1] (next) (vcm)INITIAL INTEGRATE fluid_dist_sync_13_finish
(prev)X fluid_dist_3
[..0] (prev)fluid_vcm_remove_0
[..0] (prev)X fluid_dist_3
[..G+S+F] (prev)X fluid_dist_3
[1..G+S+F] (prev)fluid_dist_sync_13_start
[-G-S-F..1] (prev)fluid_dist_eq_sync_13_finish
(vcm)X fluid_dist_eq_3
[..0] (vcm)fluid_dist_eq_next_0
[..0]X fluid_dist_eq_3
[..G+S+F] (vcm)X fluid_dist_eq_3
[1..G+S+F] (vcm)fluid_dist_eq_sync_13_start
[-G-S-F..1] (vcm)fluid_dist_eqn_next_3
X fluid_dist_eq_3
[..G+S+F] (vcm)fluid_dist_eq_sync_13_finish
[1..G+S+F] (vcm)X fluid_dist_eqn_3
[..G+S+F] (vcm)fluid_dist_next_2
X fluid_dist_3
[..G+S+F] (prev)fluid_dist_sync_13_finish
[0..G+S+F] (prev)X fluid_dist_3
[..G+S] (vcm)X fluid_dist_eqn_3
[..G+S+F] (vcm)fluid_dist_eqn_next_3
[..G+S+F]fluid_param_next_2
X fluid_dist_3
[..G+S] (vcm)fluid_dist_next_2
[..G+S] (vcm)X fluid_density_2
[..G+S/D]X fluid_velocity_2
[..G+S] (half) (vcm)PRE FORCE atom_nonforce_write_1
X atom
[..G]PRE FORCE X atom_position2
[..G]X atom_velocity
[..G]X atom_type
[..G]X atom_mass
[..G]atom_orders_compute_1
X atom_position2
[..G]atom_nonforce_write_1
[..G]X atom_position2_z
[..G]X atom_index
[..G]atom_nonforce_sort_1
X atom_position2
[..G]atom_nonforce_write_1
[..G]X atom_position2
[..G] (sorted)X atom_velocity
[..G]atom_nonforce_write_1
[..G]X atom_velocity
[..G] (sorted)X atom_mass
[..G]atom_nonforce_write_1
[..G]X atom_mass
[..G] (sorted)X atom_type
[..G]atom_nonforce_write_1
[..G]X atom_type
[..G] (sorted)X atom_position2_z
[..G]atom_orders_compute_1
[..G]X atom_position2_z
[..G] (sorted)X atom_index
[..G]atom_orders_compute_1
[..G]X atom_index
[..G] (sorted)fluid_weight_sum_2
X atom_position2
[..G] (sorted)atom_nonforce_sort_1
[..G]X fluid_weight_2
[..G+S] (local)X atom_position2_z
[G..] (sorted)atom_nonforce_sort_1
[..G]fluid_weight_sync_22_start
X fluid_weight_2
[-G-S..G+S] (local)fluid_weight_sum_2
[..G+S] (local)X fluid_weight_2
[-G-S..G+S] (local)POST FORCE atom_force_write_1
X atom
[..G]POST FORCE X atom_force
[..G]atom_force_sort_1
X atom_force
[..G]atom_force_write_1
[..G]X atom_force
[..G] (sorted)X atom_index
[..G] (sorted)atom_nonforce_sort_1
[..G]fluid_weight_sync_22_finish
X fluid_weight_2
[..G+S] (local)fluid_weight_sum_2
[..G+S] (local)X fluid_weight_2
[..G+S]X fluid_weight_2
[-G-S..G+S] (remote)fluid_weight_sync_22_start
[-G-S..G+S] (local)atom_force_next_1
X fluid_weight_2
[..G+S]fluid_weight_sync_22_finish
[..G+S]X atom_spread
[..G] (sorted)X fluid_density_2
[..G+S]fluid_param_next_2
[..G+S/D]atom_kinetic
[..G] (sorted)X fluid_velocity_2
[..G+S] (half) (vcm)fluid_param_next_2
[..G+S] (half) (vcm)X atom_force
[..G] (fluid) (sorted)X atom_position2
[..G] (sorted)atom_nonforce_sort_1
[..G] (sorted)X atom_velocity
[..G] (sorted)atom_nonforce_sort_1
[..G] (sorted)X atom_mass
[..G] (sorted)atom_nonforce_sort_1
[..G] (sorted)X atom_type
[..G] (sorted)atom_nonforce_sort_1
[..G] (sorted)X atom_force
[..G] (sorted)atom_force
[..G] (sorted)fluid_force_next_2
X atom_position2
[..G] (sorted)atom_nonforce_sort_1
[..G] (sorted)X fluid_force_2
[..G+S] (local)X atom_position2_z
[..G] (sorted)atom_nonforce_sort_1
[..G] (sorted)X atom_spread
[..G] (sorted)atom_force_next_1
[..G] (sorted)X atom_force
[..G] (fluid) (sorted)atom_force_next_1
[..G] (fluid) (sorted)X fluid_weight_2
[..G+S]fluid_weight_sync_22_finish
[..G+S]fluid_force_sync_20_start
X fluid_force_2
[0..G+S] (local)fluid_force_next_2
[..G+S] (local)X fluid_force_2
[0..G+S] (local)fluid_force_sync_20_finish
X fluid_force_2
[..0] (local)fluid_force_next_2
[..G+S] (local)X fluid_force_2
[..0]X fluid_force_2
[-G-S..0] (remote)fluid_force_sync_20_start
[0..G+S] (local)fluid_param_u_correct_0
X fluid_density_2
[..0]fluid_param_next_2
[..G+S/D]X fluid_velocity_2
[..0] (vcm)X fluid_velocity_2
[..0] (half) (vcm)fluid_param_next_2
[..G+S] (half) (vcm)X fluid_force_2
[..0]fluid_force_sync_20_finish
[..0]atom_force_unsort_1
X atom_force
[..G] (fluid) (sorted)atom_force_next_1
[..G] (fluid) (sorted)X atom_force
[..G] (fluid)X atom_index
[..G] (sorted)atom_nonforce_sort_1
[..G] (sorted)atom_force_read_1
X atom_force
[..G] (fluid)atom_force_unsort_1
[..G] (fluid)X atom
[..G]FINAL INTEGRATE END OF STEP vcm_total_calc_0
X atom
[..G]ENDOFSTEP X vcm_total
X fluid_density_2
[..0]fluid_param_next_2
[..G+S/D]X fluid_velocity_2
[..0] (vcm)fluid_param_u_correct_0
[..0] (vcm)fluid_vcm_remove_0
X vcm_total
vcm_total_calc_0
X fluid_velocity_2
[..0]X fluid_velocity_2
[..0] (vcm)fluid_param_u_correct_0
[..0] (vcm)X fluid_dist_3
[..0]X fluid_dist_3
[..0] (vcm)fluid_dist_next_2
[..G+S] (vcm)fluid_dist_eq_3
[..0]fluid_dist_eq_3
[..0] (vcm)fluid_dist_eq_next_0
[..0] (vcm)fluid_dist_eqn_3
[..0]fluid_dist_eqn_3
[..0] (vcm)fluid_dist_eqn_next_3
[..G+S+F] (vcm)restartWrite
fluid_force_2
[..0]fluid_force_sync_20_finish
[..0]fluid_dist_3
[..0]fluid_vcm_remove_0
[..0]fluid_dist_eq_3
[..0]fluid_vcm_remove_0
[..0]fluid_dist_eqn_3
[..0]fluid_vcm_remove_0
[..0]fluid_dist_sync_13_start
X fluid_dist_3
[-G-S-F..-1]fluid_vcm_remove_0
[..0]X fluid_dist_3
[-G-S-F..-1]fluid_dist_eq_next_0
(next)X fluid_density_2
[..D]fluid_param_next_2
[..G+S/D]X fluid_dist_eq_3
[..0] (next) (vcm)X fluid_velocity_2
[..0]fluid_vcm_remove_0
[..0]X fluid_force_2
[..0]fluid_force_sync_20_finish
[..0]fluid_dist_eq_sync_13_start
(next) (vcm)X fluid_dist_eq_3
[-G-S-F..-1] (next) (vcm)fluid_dist_eq_next_0
[..0] (next) (vcm)X fluid_dist_eq_3
[-G-S-F..-1] (next) (vcm)
[ ]
Remove explicit equilibrium distribution
12. DONE Discovered code fixes [7/7]
[X]
Missed one-processor GPU boundary exchange optimization forfluid_force
[X]
Missing OpenCL event free on most kernel calls[X]
Remove momentum code is broken[X]
Only waits on OpenCL events (needs to correctly handle MPI)[X]
fluid_momentum_remove
should be usingfluid_velocity_2
andfluid_density_2
so it sees the results offluid_correctu_next_0
[X]
Fix should just set flag to enable an not call directly (to get in correct spot)[X]
Verify step passed in is correct (should it bestep
orstep+1
– former is correct)[X]
Kernel definesdist_3_new = ... (step & 0x01) ...
wherenew
should actually be bestep+1 & 0x01
[X]
Test periodic boundary conditions- call fix momentum, should leave it zero
- add small body force, remove momentum every 10 steps, should see grow and then drop
[X]
Fix lb/fluid/rigid/pc/sphere `omega` calculation diverges a bit on restart- likely issue with `setup` `omega` calculation vs `initial/finalintegrate` one
- replaced by the standard rigid fix (gone in next iteration)
[X]
fluid_correctu_next_2
is called when we only havefluid_force_2 [0]
boundaries- verified with Colin and should only be calculating on interior
fluid_correctu_next_0
- verified with Colin and should only be calculating on interior
[X]
Likely shifts momentum in first step due toInitializeFirstRun
computing fluid force- this would be applied in the first 1/2 step
- for consistency with other other lammps integrators, should be zero for this
- first calculation should occur in post force routine
[X]
Not getting good local size fromkernel_local_box
- Working correctly, was limited by the 64 work groups per compute unit requirement
13. TODO Code cleanups [15/31]
[ ]
fluid_dist_eqn_next_3
has duplicate non-local
prefixedn3
,stide3
, andoffets3
variables[X]
Could use some whitespace clean as there is some tailing whitespace and a mix of tabs vs spaces[ ]
Should make the offseting in the?boundaries_*
gpu routines more readable (same style as others)[X]
Expect the(__global realType*)
casts in new gpu code are not required- Are actually as it is loading a single component from variable locations
[ ]
See if now gpu bounce back code could be written in a nicer vectorized way[ ]
Checking queue properties of devices should use existing pre-wrapped error check versions[X]
Drop localme
parameter (currently a mix of localme
andcomm->me
)[ ]
Proper setting oftypeLB
is not verifed (required for things like initialization ofK0
)[ ]
gridgeo_2
memory should beconst
in all (most?) GPU kernels[X]
Backtrace cleanup is a mess and printing garbage[3/3]
- Need to compile with the
-rdynamic
option to not strip out static function names [X]
Bug due tobacktrace
allocating one chunk for strings and pointers[X]
Exception safety via C++11unique_ptr
rewrite[X]
Figure out why function names are included
- Need to compile with the
[X]
Fix differences between fluid dumping and atom dumping (via dump routines)- Dump code (
ntimesteps
based) and fluid code (step
based) are off by one in theirupdate->ntimestep
is0
insetup
and then1
for the first stepstep
is-1
insetup
and then0
for first step
- Dump code dumps in both
setup
and regular steps while fluid code only does regular steps
- Dump code (
[ ]
Check with Colin about treatingsw
(y sidewalls) boundary the same as z one (currently one point short)[ ]
Renamelattice
variables to match GPUgeogrid
name[ ]
Baselattice
size calculations on gpu size variables and notsubNb{x,y,z}
ones[X]
Make code cleaner by using
post_run
(can get rid of everything but send bit insetup
)- not doing as isn't cleaner due to
setup
(orinit
) mismatch withpost_run
setup
is always run the first time and then only if the runpre
flag is setpost_run
is always run
- how to handle calculation of forces in setup (1/2 step velocity is no longer available)
- setup compute the forces
- no need for momentum remaval (user will have disabled if they have introduce net momentum)
- setup minimal, we are okay, no recomputations required
- setup compute the forces
constructor setup
initial_integrate
… final_integrate
post_run
destructor init_dist_*
newsend_dist_*
newrecv_dist_* old
send_dist_eq
newdrop_dist_eq
new+calc_force
xchg_force
calc dist_eq
new+send_dist_eq
new+recv_dist_eq new
… send dist_*
newrecv_dist_*
newcalc dist_eq
new+send dist_eq
new+recv_dist_eq
new+init_dist_*
newif not ~post_run
send dist_*
new: recv_dist_*
oldcalc_force
xchg_force
calc dist_eq
new+send_dist_eq
new+: recv_dist_eq
new… send dist_*
newrecv_dist_*
newcalc dist_eq
new+send dist_eq
new+recv_dist_eq
new+- not doing as isn't cleaner due to
[ ]
Should be able to disable profiling as may reduce performance[ ]
Can probably drop explicitstd::string
constructors[X]
Add ghost points suffix tofluid_momentum_remove_kernel
- Was renamed to
fluid_vcm_remove_0
- Was renamed to
[X]
Clean up gpu boundary exchange naming to match others[ ]
Cleanup compilation warnings[ ]
Fixsize_t
related narrowing conversion warnings
[X]
Combine profile averages and only output from first rank[X]
fluid_correctu_next_2
should be justfluid_correctu_next_0
as onlyfluid_force_2 [0]
is available[X]
Clarify components acted upon in write, rewrite, sort, resort, and read in names- Made in final version of action code
[ ]
Deduplicate buffer specific inner syncing with macros like outer ones[X]
sync_inner_*
routines' inner target buffers are larger than they need to be (outside_offset_*[1] = mem_border
)[X]
atom_force_next
should clip stencil to avoid past end access if atom goes out of bounds[X]
Floating point constants without f suffix wull causing calculation to be in double[X]
Consistent handling and naming of boundary conditions- Prefix with effect (wall, pressure) and
- Switch to global location condition inside of GPU
[ ]
Could skipatom_force_read_1
if gamma scalling is all negative for all atom types[ ]
Should switchatom_force
(combined) from fluid to atom units to avoid confusion[ ]
Replacelong
withbigint
14. TODO Documentation updates [0/4]
[ ]
need documentation as to why the routines[ ]
and a description of the comments[ ]
add tables to documentation (describe the cloumns and the bracketed terms)[ ]
note about the HEX file needing to be there for the build
15. TODO Investigate potential GPU options before switch to HIP [5/12]
[X]
Full profile information dump[2/2]
[X]
Switch from total runtimes to individual time points[X]
Output profile information
[X]
Remove waits from sorting routines[1/1]
- should we stick all the event handlers in a queue
[X]
Try on
atom_sort_position2_z_*
routinesoperation wait between wait at end speedup sort_..._bb
7.57 5.98 21.0% sort_..._bm
6.82 5.32 22.0% combined wall 7.71 5.29 31.4%
[ ]
Test run with weight set to1
and withtrilinear
- Is the atom searching or the stencil costing the most
[ ]
Can optimize complete map writing withCL_MAP_WRITE_INVALIDATE_REGION
instead ofCL_MAP_WRITE
[ ]
Mapping pinned memory should give the fastest access- NVIDIA OpenCL optimization guide explains how to suggest pinning
[ ]
Use memory mapping to copy buffers- fluid distribution reads and force array
- may be able to work with unified memory
[X]
Record starting and stopping times for profiling to determine key path holdups[ ]
Non-blocking GPU operations do not nessesarily execute unless there is a flush- Likely okay under NVIDIA except for Windows (from NVIDIA OpenCL optimization guide)
- Could be a reason for the slow down under AMD
[ ]
Where would it be worthwhile to use local memory[X]
Sometimesfluid_force_next_2
is 3.7x slower (1.03 to 3.81ms)- tracked down to broken riser card on gra986 asserting hardware power brake on GPU card
[ ]
Check regular code on thread safe MPI- Cluster OpenMPI supports thread if initialized with
MPI_Init_thread
insteadMPI_Init
as in LAMMPS - Doesn't seem to be any significant slowdown from replacing
MPI_Init
withMPI_Init_thread
- Cluster OpenMPI supports thread if initialized with
[X]
Use events between groups
16. Item for future
- Explore some ideas around geogrid code
- Use bit vector to track bounce back status for each cell
- Could store in texture memory as constant for entire simulation
- Arbitrary geometry support by loading mask of fluid vs non-fluid
- could the new
gridgeo
system be used to subsume the wall bounce back code (moving walls likely an issue)
- could the new
- Use bit vector to track bounce back status for each cell
- Resync with upstream lammps
- now includes a cmake build system
- Cleanup old subNbx initialization calculations
- had intended to in prior DP, but never got around to it
- Changes in force to handle multiple particles contributing to a grid point better
- Can likely incorporate into current one pass on GPU
- Better CPU/GPU border exchange code (current all or nothing)
- verify GPU options work correctly