Colin DP
Table of Contents
- 1. DONE Review/get up to speed on changes
[4/4] - 2. DONE Merge in MPI code
[10/10] - 3. DONE Add GPU (and CPU) profiling
[8/8] - 4. DONE Profile analysis system
[9/9] - 5. DONE Profile discovered optimizations/fixes
[5/5] - 6. DONE Free up more concurrency
[4/4] - 7. DONE Change ordering of array in GPU version
[12/12] - 8. DONE Resync/compare with reference code/LAMMPS
[8/8] - 9. DONE Selection of GPU to use
[2/2] - 10. DONE Implement new two pass interpolation
[4/4] - 11. TODO Resync with reference CPU code
[2/4] - 12. DONE Discovered code fixes
[7/7] - 13. TODO Code cleanups
[15/31] - 14. TODO Documentation updates
[0/4] - 15. TODO Investigate potential GPU options before switch to HIP
[5/12] - 16. Item for future
Scratch space for tracking progress and keep track of items on Colin's DP.
1. DONE Review/get up to speed on changes [4/4]
[X]Added variety of items to cleanups todo[X]Reviewed the new boundary items andgridgeostructure (excluding the pit setup code)- type
- 0- bounce backs (bulk fluids nodes)
- 1- bounce backs (pit geometry boundary fluid nodes)
- 2- not in fluid
- orientation (number of bounce backs)
- 0- no bounce back (couldn't this be used for bulk fluid?)
- 3- not sure – all the diagonals on a face minus 1?
- 4- not sure – two faces (inside edge) minus all furthest diagonals?
- 5- against a face
- 6- not sure – seems to be face and one of the opposite diagonals?
- 8- against two faces (inside edge)
- 10 - against three faces (inside corner)
- type
[X]Make tyson branch master- Master branch (fracnes/final-fixes tag/frances branch + one commit from Colin) is a subset of tyson branch.
[X]Figure out where MPI branch is at (sync with tyson)- MPI branch only uncomments the transfer and exchange calls.
2. DONE Merge in MPI code [10/10]
[X]Verify that the MPIORDERC,FORTRAN stuff is okay (per comment there may be an issue): pass[X]Check if woundup with both GPU and CPU border exchange active: pass[X]Fix platform initialization on later CUDAs (CPU platform won't iterate)[X]Check that boundary exchange works with at least two processors along each dimension: fail[3/3]- Need to set
comm_modify cutoffto 2.5xdxto have required particles [X]Periodic MPI deadlocks in first steps[X]Crash at end-of-simulation[X]Split along z is okay, split along y has agreement issues, and split along z looses atoms
- Need to set
[X]Fixed bug causing periodic MPI boundary exchange lockupsfixviscouslbwasn't initialized if unused so fluid force could occur at different times
[X]Fix end-of-simulation crashcommwas being destructed before fixes (modify)- Upstream pull request to fix – from discussion should switch to
post_runhook
[X]Double checked no other local state variables are unset: pass[X]Switch boardary exchange to be GPU if only one processor and MPI otherwise[X]Look into potential issues with fluid distribution exchange being in flight: pass- Constructor sets bogus fluid distribution exchange in flight
setup(called at the start of a run) finishes the one if flight, throws it away, computes the proper one for the run, and starts it in flight- Step routines expert fluid distribution in flight at start and put it inflight at end
- Destructor finishes fluid distribution exchange in flight
[X]Debug and fix disagreement between simulation results for MPI and non-MPI runs (two processor splits)[X]Get particle dumping working to visualize in paraview[X]Particles are lost when splitting x- Code for
pressurebcxlooks like it incorrectly applies to the internal side too for boundary processes
- Code for
[X]Fixpressurebcxapplying to internal side for boundary processes[X]Forces are off when splitting x or y[9/9][X]Test with individual atoms,bodyforce, and nopressurebcx: fail- Issue when split along fixed boundary side, otherwise perfect agreement
[X]Test with rigid sphere,bodyforce, and nopressurebcx: pass[X]Test with rigid sphere,bodyforce, nopressurebcx, and pits: pass[X]Test without particle interactions: fail (not caused by particle interaction code)[X]Figure out how to do paraview visualization of differences- Append attributes filter on multiple inputs and then calculator for relative error
- Minimal size test example
- Error starts in split and then jumps to ends as well
[X]Double checkedEDGE_Z{0,1}usage for acting on both sides[X]Test with all edge code disabled: failed[X]Test with 3 processors along z: error on both boundaries, starting on left outer one: error on both sides of ends[X]Looks into end extra point old adjustment code
[X]Figure out why splitting along fixed boundary gives different results for MPI[3/3][X]Add boundary dump code after each GPU routine call[X]Create a serial vs parallel dump comparison program[X]Extend boundary dump to entire field
[X]Fix discovered geogrid issue (requires 3 boundary points and not 2)[X]Add additional boundary point to calculation[X]sublatticeinitialization code useswholelatticewrap ofNbz: correct as only applies to boundary[X]Update dump routines[X]Update gpu routines to use 3 boundary offsets value and names
3. DONE Add GPU (and CPU) profiling [8/8]
[X]Read up on profiling options[X]Turn profiling one inclCreateCommandQueue[X]Add time tracking to kernels[18/18][X]fluid_dist_eq_initial_3_kernel[X]fluid_dist_eq_next_0_kernel[X]fluid_dist_eqn_next_3_kernel[X]fluid_dist_initial_2_kernel[X]fluid_dist_next_2_kernel[X]fluid_param_initial_2_kernel[X]fluid_param_next_2_kernel[X]fluid_correctu_next_2_kernel[X]xboundaries_fluid_dist_eq_kernel[X]yboundaries_fluid_dist_eq_kernel[X]zboundaries_fluid_dist_eq_kernel[X]xboundaries_fluid_dist_kernel[X]yboundaries_fluid_dist_kernel[X]zboundaries_fluid_dist_kernel[X]xboundaries_fluid_force_kernel[X]yboundaries_fluid_force_kernel[X]zboundaries_fluid_force_kernel[X]remove_momentum_kernel
[X]Print profiling at end[X]Add time tracking to memory read/writes[7/7][X]fluid_dist_3_1_interior_read[X]fluid_dist_eq_3_3_interior_read[X]fluid_force_2_exterior_read[X]fluid_force_2_accumulate_read[X]fluid_dist_3_1_exterior_write[X]fluid_dist_eq_3_3_exterior_write[X]fluid_force_2_accumulate_write
[X]Debug segfault introduced in unrelated OpenCL call[1/1][X]Break appart commits and bisect- Accidentally removed
queueassignment inclCreateCommandQueuecall
- Accidentally removed
[X]Revamp profiling to fix leaks and reduce boilerplate[4/4][X]C++ template magic class to progressively push location information- Dreadful failure
[X]Abstract with profile records: location, rank, info, value[X]Resolve race conditions/locking by dumping different ranks to different files[X]Switch to binary format for space- Compact everything into a location field by combining bits
- Reduce data required to pass between functions
- Replace strings with enums
[X]Add CPU profile points for start and end of fix callouts and expensive CPU operations[8/8][X]GPU timer requires OpenCL 2.1 (no NVIDIA), use CPU timer instead[X]initial_integrate[X]pre_force[X]post_force[X]final_integrate[X]atom_read[X]atom_write[X]fluid_force_accumulate
4. DONE Profile analysis system [9/9]
[X]Initial analysis for intra-timestep details[X]Python code extract binary data format[X]Switch to R as python pandas apply functionality too slow- nearly 8 minutes in python pandas vs 2 seconds in R tidyverse
- don't have a good way to load data: use python to save as a feather file
[X]Initial plot of timestep- transfer and atom related routines seem to be taking most time
[X]Fix issues revealed in profile code[X]Distinguish between two uses offluid_dist_read/writeroutines[X]Fix non-uniquefluid_dist_read/writeprofile points (eq vs non-eq)
[X]Add group brackets to step breakout[X]Distinguish between executing and non-executing state in intra analysis[X]Add cross-timestep analysis[2/2][X]Plot of walltime for each step (group analysis)- Neighbour calculation is very expensive (x100 over regular step)
[X]Distributions of walltime for each group of GPU calls (inter analysis)
[X]Add reference to distribution to group analysis[X]Integrate addition of CPU clocks[X]Redo implicit ordering calculation code (no longer simple)- Add substep calculation as CPU and GPU out of order wrt substep
[X]Syncronize CPU and GPU clocks from data- Have to correct for slight clock skew as well as offset (linear model)
[X]Add special cases to handle missing events in CPU only profiles
[X]Mark run periods of all LAMMPS fix calls in analysis[X]Drop python code required for initial loading of data
5. DONE Profile discovered optimizations/fixes [5/5]
[X]Switchatom_sort_indexto write to recorded index position instead of resorting[10/10]- 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 96.5 -> 84.3 (-12.6%)
[X]Add memory for copy out space (put in end)[X]Adjust memory resize routines[X]Create new GPU functionatom_unsort_position2_zand remove old ones[X]Add kernel variables and initialize and deinitialize[X]Add local and global size variables and initialize[X]Calls for initial arguments[X]Update all arguments (need to include force and index too!) on atom resize[X]Add profile enums and remove old ones[X]Call to invoke with final arguments[X]Updateatom_read
[X]Switch secondatom_sort_position2_zto just gather new velocity values instead of resorting everything[10/10]- 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 84.3 -> 73.5 (-12.8%)
[X]Add memory for copy out space (put in end)[X]Adjust memory resize routines[X]Add new GPU functionatom_resort_position2_z[X]Add kernel variable and initialize and de-initialize[X]Add local and global size variables and initialize[X]Calls for initial arguments[X]Update all arguments (need to include index too!) and local and global size on atom resize[X]Add profile enum[X]Call to invoke with final arguments[X]Addatom_rewrite
[X]Investigate and fix duplicate running of atom related routines- 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 73.5 -> 69.0 (-6.1%)
- If no forces added via
fix_lb_viscous_gpuand orfix_lb_rigid_pc_sphere_gpucould skip a bunch of atom routines includingcorrect_u - The positions don't change so the sort mapping is fixed throughout
- The 1/2 step velocity is the correct one to be using and not the 1 step as is being done
- Push
atom_forcedown down topost_force(more lammps standard and would like to be able to add forces into GPU routine) - Technically
correct_uin is afinal_integratesort of thing so put it there Steps overview
Routine Operation fluid_dist_exchange_finish-1Recieve fluid_dist_3_new-exterior [G+S+F] (for previous step)fluid_dist_eq_exchange_finishRecieve fluid_dist_eq_3_newexterior [1..G+S+F]fluid_dist_eqn_next_3fluid_dist_eq_3_new[0..G+S+F] ->fluid_dist_eqn_3_new[0..G+S+F]fluid_dist_next_2fluid_dist_~{eq,eqn}_3_{new,old} [0..G+S+F], ~fluid_dist_3_old[0..G+S+F],fluid_density_2[..F] ->fluid_dist_3_new[0..G+S]restartWritefluid_dist_eq_3_new[0..G+S+F],fluid_dist_3_new[0..G+S] -> restart file (if requested)fluid_dist_exchange_startSend fluid_dist_3_newinterior [-G-S-F]fluid_param_next_2_startStart fluid_dist_3_new[0..G+S] ->fluid_velocity_2[0..G+S],fluid_density_2[0..G+S]fluid_param_next_2_finishEnd fluid_dist_3_new[0..G+S] ->fluid_velocity_2[0..G+S],fluid_density_2[0..G+S]atom_writeAtom [G] -> atom_position2[G],atom_velocity[G],atom_type[G],atom_mass[G]atom_position2_z_indexatom_position2[G] ->atom_position2_z[G],atom_index[G]atom_sort_position2_zSort by z atom_position2[G],atom_velocity[G],atom_mass[G],atom_type[G],atom_position2_z[G],atom_index[G]atom_forcefluid_velocity_2[0..G+S],fluid_density_2[0..G+S],atom_position2[G],atom_velocity[G],atom_mass[G],atom_type[G] ->atom_force[G]fluid_force_next_2atom_position2[G],atom_position2_z[G],atom_force[G] -> localfluid_force_2[0..G+S]fluid_force_exchange_startSend local fluid_force_2exterior [0..G+S]fluid_force_exchange_finishRecieve remote fluid_force_2interior [-G-S..0]Local fluid_force_2[0], remotefluid_force_2[0] ->fluid_force_2[0]fluid_correctu_next_2fluid_velocity_2[0..G+S],fluid_density_2[0..G+S],fluid_force_2[0..G+S] ->fluid_velocity_2[0..G+S]fluid_dist_eq_next_0+1fluid_velocity_2[0],fluid_density_2[0..D],fluid_force_2[0] -> ~fluiddisteq3new~+ [0] (for next step)fluid_dist_eq_exchange_start+1Send ~fluiddisteq3new~+ interior [-G-S-F..-1] (for next step) atom_sort_index/atom_unsort_position2_zSort by index atom_force[G],atom_index[G]atom_readatom_force[G] -> hydroF [G]Reorganization
Pre- fix_lb_rigid_pc_sphere_gpuPost- fix_lb_rigid_pc_sphere_gpuPost- fix_lb_rigid_pc_sphere_gpuNew Step Lammps (with fix_lb_viscous_gpu)(without fix_lb_viscous_gpu)initial_integratecoordinates 1 fluid_dist_exchange_finish-1fluid_dist_exchange_finish-1fluid_dist_exchange_finish-1fluid_dist_exchange_finish-1velocities 1/2 fluid_dist_eq_exchange_finishfluid_dist_eq_exchange_finishfluid_dist_eq_exchange_finishfluid_dist_eq_exchange_finishfluid_dist_eqn_next_3fluid_dist_eqn_next_3fluid_dist_eqn_next_3fluid_dist_eqn_next_3fluid_dist_next_2fluid_dist_next_2fluid_dist_next_2fluid_dist_next_2restartWriterestartWriterestartWriterestartWritefluid_dist_exchange_startfluid_dist_exchange_startfluid_dist_exchange_startfluid_dist_exchange_startfluid_param_next_2_startfluid_param_next_2_startfluid_param_next_2_startfluid_param_next_2_startpost_integratepre_exchangepre_neighbourpost_neighbourpre_forcefluid_param_next_2_finishfluid_param_next_2_finishfluid_param_next_2_finishfluid_param_next_2_finishatom_writeatom_writeatom_writeatom_writeatom_position2_z_indexatom_position2_z_indexatom_position2_z_indexatom_position2_z_indexatom_sort_position2_zatom_sort_position2_zatom_sort_position2_zatom_sort_position2_zatom_forceatom_forceatom_forcefluid_force_next_2fluid_force_next_2fluid_force_exchange_startfluid_force_exchange_startfluid_force_exchange_finishfluid_force_exchange_finishfluid_correctu_next_2fluid_dist_eq_next_0+1fluid_dist_eq_next_0+1fluid_dist_eq_exchange_start+1fluid_dist_eq_exchange_start+1pre_reversepost_forceadditional forces atom_sort_indexatom_sort_indexatom_sort_indexatom_forceatom_readatom_readatom_readfluid_force_next_2fluid_force_exchange_startfluid_force_exchange_finishatom_unsort_position2_zatom_readfinal_integratevelocity 1 fluid_correctu_next_2fluid_dist_eq_next_0+1fluid_dist_eq_exchange_start+1end_of_stepatom_writeatom_position2_z_indexatom_sort_position2_zatom_forcefluid_force_next_2fluid_force_exchange_startfluid_force_exchange_finishfluid_correctu_next_2fluid_dist_eq_next_0+1fluid_dist_eq_exchange_start+1
[X]Queuing multiple jobs appears to sometimes slowdown unrelated computation routines- Defective riser card on GPU asserting power brake and cutting clocks to 1/3rd
[X]Duplicate rectange read in profile code (code wasn't currently used)
6. DONE Free up more concurrency [4/4]
[X]Remove blocking OpenCL calls (all explicit dependencies)
Done Routine Input Buffers Source Output Buffers INITIAL INTEGRATE X fluid_dist_sync_13_finish(prev)fluid_dist_3[..0] (prev)fluid_vcm_remove_0[..0] (prev)fluid_dist_3[1..G+S+F] (prev)X fluid_dist_eq_sync_13_finish(prev)fluid_dist_eq_3[..0] (prev)fluid_vcm_remove_0[..0] (prev)fluid_dist_eq_3[1..G+S+F] (prev)X fluid_dist_eqn_sync_13_finish(prev)fluid_dist_eqn_3[..0] (prev)fluid_vcm_remove_0[..0] (prev)fluid_dist_eqn_3[1..G+S+F] (prev)X fluid_dist_eq_sync_13_finish(vcm)fluid_dist_eq_3[..0] (vcm)fluid_dist_eq_next_0[..0]fluid_dist_eq_3[1..G+S+F] (vcm)X fluid_dist_eqn_next_3fluid_dist_eq_3[..G+S+F] (vcm)fluid_dist_eq_sync_13_finish[1..G+S+F] (vcm)fluid_dist_eqn_3[..G+S+F] (vcm)X fluid_dist_next_2fluid_dist_3[..G+S+F] (prev)fluid_dist_sync_13_finish[1..G+S+F] (prev)fluid_dist_3[..G+S] (vcm)fluid_dist_eq_3[..G+S+F] (prev)fluid_dist_eq_sync_13_finish[1..G+S+F] (prev)fluid_dist_eqn_3[..G+S+F] (prev)fluid_dist_eqn_sync_13_finish[1..G+S+F] (prev)fluid_dist_eq_3[..G+S+F] (vcm)fluid_dist_eq_sync_13_finish[1..G+S+F]fluid_dist_eqn_3[..G+S+F] (vcm)fluid_dist_eqn_next_3[..G+S+F]X fluid_param_next_2fluid_dist_3[..G+S] (vcm)fluid_dist_next_2[..G+S] (vcm)fluid_density_2[..G+S]fluid_velocity_2[..G+S] (half) (vcm)PRE FORCE X atom_nonforce_write_1atom[..G]atom_position2[..G]atom_velocity[..G]atom_type[..G]atom_mass[..G]X atom_orders_compute_1atom_position2[..G]atom_nonforce_write_1[..G]atom_position2_z[..G]atom_index[..G]X atom_nonforce_sort_1atom_position2[..G]atom_nonforce_write_1[..G]atom_position2[..G] (sorted)atom_velocity[..G]atom_nonforce_write_1[..G]atom_velocity[..G] (sorted)atom_mass[..G]atom_nonforce_write_1[..G]atom_mass[..G] (sorted)atom_type[..G]atom_nonforce_write_1[..G]atom_type[..G] (sorted)atom_position2_z[..G]atom_orders_compute_1[..G]atom_position2_z[..G] (sorted)atom_index[..G]atom_orders_compute_1[..G]atom_index[..G] (sorted)POST FORCE X atom_force_write_1atom[..G]atom_force[..G]X atom_force_next_1fluid_density_2[..G+S]fluid_param_next_2[..G+S]atom_force[..G] (sorted)fluid_velocity_2[..G+S] (half) (vcm)fluid_param_next_2[..G+S] (half) (vcm)atom_position2[..G] (sorted)atom_inputs_sort_1[..G] (sorted)atom_velocity[..G] (sorted)atom_inputs_sort_1[..G] (sorted)atom_mass[..G] (sorted)atom_inputs_sort_1[..G] (sorted)atom_type[..G] (sorted)atom_inputs_sort_1[..G] (sorted)X fluid_force_next_2atom_position2[..G] (sorted)atom_inputs_sort_1[..G] (sorted)fluid_force_2[..G+S] (local)atom_position2_z[..G] (sorted)atom_inputs_sort_1[..G] (sorted)atom_force[..G] (sorted)atom_force_next_1[..G] (sorted)X fluid_force_sync_20_startfluid_force_2[0..G+S] (local)fluid_force_next_2[..G+S] (local)X fluid_force_sync_20_finishfluid_force_2[-G-S..0] (local)fluid_force_next_2[..G+S] (local)fluid_force_2[-G-S..0]X fluid_param_u_correct_0fluid_velocity_2[..0] (half) (vcm)fluid_param_next_2[..G+S] (half) (vcm)fluid_velocity_2[..0] (vcm)fluid_density_2[..0]fluid_param_next_2[..G+S]fluid_force_2[..0]fluid_force_sync_20_finish[..0]X atom_force_unsort_1atom_force[..G] (sorted)atom_force_next_1[..G] (sorted)atom_force[..G]atom_index[..G] (sorted)atom_inputs_sort_1[..G] (sorted)X atom_force_read_1atom_force[..G]atom_force_unsort_1[..G]atom[..G]FINAL INTEGRATE END OF STEP X vcm_total_calc_0atom[..G]vcm_totalfluid_density_2[..0]fluid_param_next_2[..G+S]fluid_velocity_2[..0] (vcm)fluid_param_u_correct_0[..0] (vcm)X fluid_vcm_remove_0vcm_totalvcm_total_calc_0fluid_velocity_2[..0]fluid_velocity_2[..0] (vcm)fluid_param_u_correct_0[..0] (vcm)fluid_dist_3[..0]fluid_dist_3[..0] (vcm)fluid_dist_next_2[..G+S] (vcm)fluid_dist_eq_3[..0]fluid_dist_eq_3[..0] (vcm)fluid_dist_eq_next_0[..0] (vcm)fluid_dist_eqn_3[..0]fluid_dist_eqn_3[..0] (vcm)fluid_dist_eqn_next_3[..G+S+F] (vcm)X restartWritefluid_force_2[..0]fluid_force_sync_20_finish[..0]fluid_dist_3[..0]fluid_vcm_remove_0[..0]fluid_dist_eq_3[..0]fluid_vcm_remove_0[..0]fluid_dist_eqn_3[..0]fluid_vcm_remove_0[..0]X fluid_dist_sync_13_startfluid_dist_3[-G-S-F..-1]fluid_vcm_remove_0[..0]X fluid_dist_eq_sync_13_startfluid_dist_eq_3[-G-S-F..-1]fluid_vcm_remove_0[..0]X fluid_dist_eqn_sync_13_startfluid_dist_eqn_3[-G-S-F..-1]fluid_vcm_remove_0[..0]X fluid_dist_eq_next_0(next)fluid_density_2[..D]fluid_param_next_2[..G+S]fluid_dist_eq_3[..0] (next) (vcm)fluid_velocity_2[..0]fluid_vcm_remove_0[..0]fluid_force_2[..0]fluid_force_sync_20_finish[..0]X fluid_dist_eq_sync_13_start(next) (vcm)fluid_dist_eq_3[-G-S-F..-1] (next) (vcm)fluid_dist_eq_next_0[..0] (vcm) (next)fluid_correctu_next_0(nowfluid_param_u_correct_0) adjustsfluid_velocity_2 [0]- doesn't adjust any of
fluid_dist*_3 [0](this is correct)
- doesn't adjust any of
fluid_momentum_remove(nowfluid_vmc_remove_0) adjustsfluid_dist_3 [0]andfluid_dist_eq_3 [0]- should use corrected
fluid_velocity_2 [0]instead of computing uncorrected version fromfluid_dist*_3 [0] - need to also correct
fluid_dist_eqn_3 [0..G+S+F]forfluid_dist_next_2if using exponential integrator - can't compute border points
[...G+S+F](only havefluid_velocity_2 [0]) so need to trasfer these - need to delay
fluid_dist_exchange_3_startuntil after this - need to send borders on other outputs too if using exponential integrator
- should use corrected
fluid_dist_eq_next_0computesfluid_dist_eq_3 [0](next)- from
fluid_velocity_2 [0](affected byfluid_correctu_next_0but not affected byfluid_momentum_remove)
- from
fluid_dist_eqn_next_3computesfluid_dist_eqn_3 [0..G+S+F](next)- from
fluid_dist_eq_3 [0](affected byfluid_correctu_next_0but not affected byfluid_momentum_remove) - could run earlier if
fluid_vmc_remove_0isn't being run this timestep
- from
fluid_dist_next_2computesfluid_dist_3 [0..G+S](next)- from
fluid_dist_3 [0..G+S](not affected byfluid_correctu_next_0and not affected byfluid_momentum_remove) - from
fluid_dist_{eq,eqn}_3 [0..G+S+F](not affected byfluid_correctu_next_0but affected byfluid_momentum_remove) - from
fluid_dist_{eq,eqn}_3 [0..G+S+F](next) (affected byfluid_correctu_next_0but not affected byfluid_momentum_remove)
- from
INITIAL INTEGRATE fluid_dist_exchange_13_finish(prev)fluid_dist_3_action(impl)fluid_dist_eq_exchange_13_finish(prev)fluid_dist_eq_3_action(impl)fluid_dist_eqn_exchange_13_finish(prev)fluid_dist_eqn_3_action(impl)fluid_dist_eq_exchange_13_vcm_finishfluid_dist_eq_3_vcm_action(impl)fluid_dist_eqn_next_3fluid_dist_eqn_3_vcm_actionfluid_dist_next_2fluid_dist_2_vcm_actionrestartWritefluid_param_next_2fluid_velocity_2_half_vcm_actionfluid_density_2_actionPRE FORCE atom_lammps_write_1atom_position1_actionatom_velocity_1_actionatom_type_1_actionatom_mass_1_actionatom_orders_compute_1atom_position2_z_1_actionatom_index_1_actionatom_inputs_sort_1atom_position2_1_sorted_actionatom_velocity_1_sorted_actionatom_mass_1_sorted_actionatom_type_1_sorted_actionatom_position2_z_1_sorted_actionatom_index_1_sorted_actionPOST FORCE atom_force_next_1atom_force_1_sorted_actionfluid_force_next_2fluid_force_2_local_actionfluid_force_exchange_02_startfluid_force_exchange_20_finishfluid_force_20_remote_action(impl)fluid_force_combine_20fluid_force_0_actionatom_force_unsort_1atom_force_1_actionatom_force_read_1FINAL INTEGRATE (impl) FINAL INTEGRATE fluid_param_u_correct_0fluid_velocity_0_vcm_action(formally fluid_correctu_next_0)END OF STEP vcm_total_calc_0vcm_total_action(impl)fluid_vmc_remove_0fluid_velocity_0_action(formally fluid_momentum_remove)fluid_dist_0_actionfluid_dist_eq_0_actionfluid_dist_eqn_0_actionfluid_dist_exchange_13_startfluid_dist_eq_exchange_13_startfluid_dist_eqn_exchange_13_startfluid_dist_eq_next_0(next)fluid_dist_eq_0_vcm_actionfluid_dist_eq_exchange_13_vcm_start(next)[X]Print routines should wait on actions/verify range- have to change pointers to values
[X]Update profile analysis code for changes[X]Can floatfluid_param_u_correct_0up topost_force
7. DONE Change ordering of array in GPU version [12/12]
- Can convert routines one at a time by wrapping with transpose code
[X]Have a look at paraview dump code[X]Switch order of array in kernels[7/7]- As we are manually calculating from a 1D index, it is technically aligned now
[X]cl_mem fluid_gridgeo_3_mem[X]cl_mem fluid_dist_3_mem[X]cl_mem fluid_dist_eq_3_mem[X]cl_mem fluid_dist_eqn_3_mem[X]cl_mem fluid_density_2_mem[X]cl_mem fluid_velocity_2_mem[X]cl_mem fluid_force_2_mem
[X]Fix offsetting in dump routines[7/7][X]print_fluid[X]print_internal[X]buffer_read_rectangle[X]buffer_write_rectangle[X]fluid_force_accumulate_rectangle[X]calc_mass_momentum[X]calc_MPT
[X]Switch order offluid_grid_geo_3_mem- Map memory, directly initialize with strides, and drop
sublattice
- Map memory, directly initialize with strides, and drop
[X]Reverse i,j,k loops for optimal stepping[X]Test new order[2/2][X]Test MPI boundary exchange[X]Test GPU boundary exchange
[X]Fix initialization offluid_grid_geo_3_mem- Copied from
sublatticeviabuffer_createwithCL_MEM_COPY_HOST_PTR - Wrong type in
sizeoffor memory map
- Copied from
[X]Fix MPI boundary exchange- Hadn't updated even/odd offset to now be on
kinstead ofi(largest step)
- Hadn't updated even/odd offset to now be on
[X]Switch to 3D threads[X]Invesigate Z-curve coordinate for atoms positions (current is C order)[X]Test restart file- Are old restart files still valid: no, now in Fortran order
[X]Fix restart files under MPI- Wrong structure element count in MPI file view type declaration
8. DONE Resync/compare with reference code/LAMMPS [8/8]
[X]Comparison with reference code- Reference code recomputes forces after full step when not used with fixlbviscousgpu
- this was introduced with fixlbrigidpcsphere
- recomputed fluid force used to recompute equilibrium distribution too
- Leaky borders
- force stencils can extend into walls
- pits derivative isn't one-sided along walls (is in CPU code, but doesn't matter as
kappa_lb=0)
pressurebcxtreats duplicate end point differently causing ghost point divergence
- Reference code recomputes forces after full step when not used with fixlbviscousgpu
[X]Updatepressurebcxto use symmetrical- Use the same point on both sides with adjustwent on both left and right on both
- Update to the newer density adjustment method
[X]Save restart file at end point like particles are saved- Need to save velocity as well as updated by
fluid_param_u_correct_0 - Need to update distribution velocities to give final step restart (see cpu code)
- Need to save velocity as well as updated by
[X]Leaky edges with pits (shows as dependency on number of ghost points transfered)GPU
bbfonly bounces back diagonals on edges if both components are wall normal- orientation 7 isn't bouncing back directions: 9-13 (likely reversed with 6)
- orientation 20 isn't bouncing back directions: 7-8, 13-14 (likely reversed with 19)
ORI Wall Normal Constructive Solid Geometry Bounce Back Components Original (if different) 0 --1 (+1, 0, 0)1, 7,10,11,142 ( 0,+1, 0)2, 7, 8,11,123 ( 0, 0,+1)5, 7, 8, 9,104 (+1, 0, 0)and( 0, 0,+1)7,101, 5, 7,105 (+1, 0, 0)or( 0, 0,+1)1, 5, 7, 8, 9,10,11,146 ( 0,+1, 0)and( 0, 0,+1)7, 82, 5, 7, 8, 9,10,11,127 ( 0,+1, 0)or( 0, 0,+1)2, 5, 7, 8, 9,10,11,122, 5, 7, 88 (-1, 0, 0)or( 0,-1, 0)3, 4, 8, 9,10,12,13,149 (-1, 0, 0)or( 0,+1, 0)2, 3, 7, 8, 9,11,12,1310 (+1, 0, 0)and( 0, 0,+1)or( 0,+1, 0)and( 0, 0,+1)7, 8,1011 (+1, 0, 0)and( 0, 0,+1)or( 0,-1, 0)and( 0, 0,+1)7, 9,1012 (-1, 0, 0)or( 0,+1, 0)or( 0, 0, +1)2, 3, 5, 7, 8, 9,10,11,12,1313 (-1, 0, 0)or( 0,-1, 0)or( 0, 0, +1)3, 4, 5, 7, 8, 9,10,12,13,1414 (-1, 0, 0)3, 8, 9,12,1315 ( 0,-1, 0)4, 9,10,13,1416 ( 0, 0,-1)6,11,12,13,1417 (-1, 0, 0)and( 0, 0, 1)8, 93, 5, 8, 918 (-1, 0, 0)or( 0, 0, 1)3, 5, 7, 8, 9,10,12,1319 ( 0,-1, 0)and( 0, 0, 1)9,104, 5, 7, 8, 9,10,13,1420 ( 0,-1, 0)or( 0, 0, 1)4, 5, 7, 8, 9,10,13,144, 5, 9,1021 ( 1, 0, 0)or( 0,-1, 0)1, 4, 7, 9,10,11,13,1422 ( 1, 0, 0)or( 0, 1, 0)1, 2, 7, 8,10,11,12,1423 (-1, 0, 0)and( 0, 0, 1)or( 0, 1, 0)and( 0, 0, 1)7, 8, 924 (-1, 0, 0)and( 0, 0, 1)or( 0,-1, 0)and( 0, 0, 1)8, 9,1025 ( 1, 0, 0)or( 0, 1, 0)or( 0, 0, 1)1, 2, 5, 7, 8, 9,10,11,12,1426 ( 1, 0, 0)or( 0,-1, 0)or( 0, 0, 1)1, 4, 5, 7, 8, 9,10,11,13,1427 ( 0, 1, 0)or( 0, 0,-1)2, 6, 7, 8,11,12,13,1428 ( 0,-1, 0)or( 0, 0,-1)4, 6, 9,10,11,12,13,1429 ( 1, 0, 0)and( 0, 0, 1)or( 0, 1, 0)2, 7, 8,10,11,1230 ( 1, 0, 0)and( 0, 0, 1)or( 0,-1, 0)4, 7, 9,10,13,1431 (-1, 0, 0)and( 0, 0, 1)or( 0, 1, 0)2, 7, 8, 9,11,1232 (-1, 0, 0)and( 0, 0, 1)or( 0,-1, 0)4, 8, 9,10,13,14
[X]Clean up old branches (just master from main tyson)[X]Add vector outputs as in current non-GPU code[2/2]- See CPU
compute_vectorand documented at end of initial comments [X]Scalar is temperature, isn't working 100% can skip[X]Addatom_kinetic(temperature) mirroringatom_spread(allocation, etc.)[X]Need to unsort these (don't actually as just need to sum)[X]Compute degrees of freedom foratom_spread
[X]Vector (length 4) is total mass and total momentum
- See CPU
[X]Resync with upstream LAMMPS (make some notes on this)- need to build the include file for the kernel code (could include it)
[X]Update IO, parsing, and string handling
9. DONE Selection of GPU to use [2/2]
[X]Specify which OpenCL devices to use- Default selection
- Hostname based overrides
[X]Print OpenCL devices used
10. DONE Implement new two pass interpolation [4/4]
- Allows exact node mass calculation (non-set gamma)
- Mass ratio requires particle and fluid mass
- Can re-weight forces on particle
[X]Compute the normalization weights- Can do the weight interpolation in
inital_integrateif we ensurenvegoes first (need to warn user if not the case) - Or just put it in the
post_integrate(put it inpre_forcefor now as haven't hookedpost_integrate)
- Can do the weight interpolation in
[X]Get forces onto GPU for spreading it between fluid and particles to tie velocities together
- Force on particle k = - hydroforce + mparticle/(mparticle + mstencilfluid) Fparticle-particlek
- Force on fluid from particle k = hydroforce + mstencilfluid/(mparticle + mstencilfluid) Fparticle-particlek
- Can just store the first as the difference works out to Fparticle-particlek
stencil_density*area * dm_lbism_stencil_fluid(at end of compute gamma interaction factor comment)- The
lbviscousroutine then would just overwrite the force
https://www.sciencedirect.com/science/article/pii/S0010465522000364?via%3Dihub
- Need interpolated fluid mass for CPU calculation (both for force calculation and for temperature)
- Need to send LAMMPS forces to GPU for force calculation
[X]Redo initialization and restart to maintain momentum[X]Implement gamma scaling and negative value hack
11. TODO Resync with reference CPU code [2/4]
[X]Linear initialization[X]Stencils[2/2][X]Add IBM3[X]Replace Peskin with Keys
[-]Remove higher order variant[1/2][X]Remove explicit code[ ]Remove extra transfers
fluid_dist_next_2:fluid_dist_eq_3_{old,new},fluid_dist_eqn_3_oldfluid_vcm_remove_0: depends on what is needed
Routine Input Buffers Source Output Buffers RESTART InitializeFirstRunfluid_force_2[..0]fluid_dist_3[..0]fluid_dist_eq_3[..0]fluid_dist_eqn_3[..0]fluid_dist_sync_13_startfluid_dist_3[-G-S-F..-1]InitializeFirstRun[..0]fluid_dist_3[-G-S-F..-1]fluid_dist_eq_sync_13_startfluid_dist_eq_3[-G-S-F..-1]InitializeFirstRun[..0]fluid_dist_eq_3[-G-S-F..-1]fluid_dist_eqn_sync_13_startfluid_dist_eqn_3[-G-S-F..-1]InitializeFirstRun[..0]fluid_dist_eqn_3[-G-S-F..-1]SETUP fluid_dist_eq_sync_13_finish(next) (vcm)fluid_dist_eq_3[..0] (next) (vcm)fluid_dist_eq_3[..G+S+F] (next) (vcm)fluid_dist_eq_3[1..G+S+F] (next) (vcm)fluid_dist_sync_13_finishfluid_dist_3[..0]fluid_dist_3[..G+S+F]fluid_dist_3[1..G+S+F]manual reset fluid_dist_3[..G+S+F]X fluid_dist_3[..G+S+F] (vcm)fluid_param_next_2fluid_dist_3[..G+S] (vcm)X fluid_density_2[..G+S/D]fluid_velocity_2[..G+S] (half) (vcm)atom_nonforce_write_1atom[..G]atom_position2[..G]atom_velocity[..G]atom_type[..G]atom_mass[..G]atom_orders_compute_1atom_position2[..G]atom_position2_z[..G]atom_index[..G]atom_nonforce_sort_1atom_position2[..G]atom_position2[..G] (sorted)atom_velocity[..G]atom_velocity[..G] (sorted)atom_mass[..G]atom_mass[..G] (sorted)atom_type[..G]atom_type[..G] (sorted)atom_position2_z[..G]atom_position2_z[..G] (sorted)atom_index[..G]X atom_index[..G] (sorted)fluid_weight_sum_2atom_position2[..G] (sorted)fluid_weight_2[..G+S] (local)atom_position2_z[G..] (sorted)fluid_weight_sync_22_startfluid_weight_2[-G-S..G+S] (local)fluid_weight_2[-G-S..G+S] (local)atom_force_write_1atom[..G]atom_force[..G]atom_force_sort_1atom_force[..G]atom_force[..G] (sorted)atom_index[..G] (sorted)fluid_weight_sync_22_finishfluid_weight_2[..G+S] (local)fluidweight2 [-G-S..G+S] fluid_weight_2[-G-S..G+S] (remote)atom_force_next_1fluid_weight_2[..G+S]atom_spread[..G] (sorted)fluid_density_2[..G+S]atom_kinetic[..G] (sorted)fluid_velocity_2[..G+S] (half) (vcm)atom_force[..G] (fluid) (sorted)atom_position2[..G] (sorted),atom_velocity[..G] (sorted)atom_mass[..G] (sorted)atom_type[..G] (sorted)atom_force[..G] (sorted)fluid_force_next_2(half)atom_position2[..G] (sorted)fluid_force_2[..G+S] (local)atom_position2_z[..G] (sorted)atom_spread[..G] (sorted),atom_force[..G] (fluid) (sorted)fluid_weight_2[..G+S]fluid_force_2[..0]fluid_force_sync_20_startfluid_force_2[0..G+S] (local)fluid_force_2[0..G+S] (local)fluid_force_sync_20_finishfluid_force_2[0..G+S] (local)X fluid_force_2[..0]fluid_force_2[-G-S..0] (remote)fluid_param_u_correct_0fluid_density_2[..0]X fluid_velocity_2[..0] (vcm)fluid_velocity_2[..0] (half) (vcm)fluid_force_2[..0]atom_force_unsort_1atom_force[..G] (fluid) (sorted)X atom_force[..G] (fluid)X atom_index[..G] (sorted)atom_force_read_1X atom_force[..G] (fluid)atom_force_unsort_1[..G] (fluid)X atom[..G]manual reset X fluid_velocity_2[..0] (vcm)fluid_param_next_2[..G+S] (half) (vcm) -half-?X fluid_velocity_2[..0]manual reset X fluid_dist_3[..G+S+F] (vcm)manual reset [..G+S+F] (vcm) X fluid_dist_3[..G+S+F]fluid_dist_eq_next_0X fluid_density_2[..D]fluid_param_next_2[..G+S/D]X fluid_dist_eq_3[..0] (next) (vcm)X fluid_velocity_2[..0]manual reset [..0] X fluid_force_2[..0]fluid_force_sync_20_finish[..0]fluid_dist_eq_sync_13_startX fluid_dist_eq_3[-G-S-F..-1] (next) (vcm)fluid_dist_eq_next_0[..0] (next) (vcm)X fluid_dist_eq_3[-G-S-F..-1] (next) (vcm)INITIAL INTEGRATE fluid_dist_sync_13_finish(prev)X fluid_dist_3[..0] (prev)fluid_vcm_remove_0[..0] (prev)X fluid_dist_3[..G+S+F] (prev)X fluid_dist_3[1..G+S+F] (prev)fluid_dist_sync_13_start[-G-S-F..1] (prev)fluid_dist_eq_sync_13_finish(vcm)X fluid_dist_eq_3[..0] (vcm)fluid_dist_eq_next_0[..0]X fluid_dist_eq_3[..G+S+F] (vcm)X fluid_dist_eq_3[1..G+S+F] (vcm)fluid_dist_eq_sync_13_start[-G-S-F..1] (vcm)fluid_dist_eqn_next_3X fluid_dist_eq_3[..G+S+F] (vcm)fluid_dist_eq_sync_13_finish[1..G+S+F] (vcm)X fluid_dist_eqn_3[..G+S+F] (vcm)fluid_dist_next_2X fluid_dist_3[..G+S+F] (prev)fluid_dist_sync_13_finish[0..G+S+F] (prev)X fluid_dist_3[..G+S] (vcm)X fluid_dist_eqn_3[..G+S+F] (vcm)fluid_dist_eqn_next_3[..G+S+F]fluid_param_next_2X fluid_dist_3[..G+S] (vcm)fluid_dist_next_2[..G+S] (vcm)X fluid_density_2[..G+S/D]X fluid_velocity_2[..G+S] (half) (vcm)PRE FORCE atom_nonforce_write_1X atom[..G]PRE FORCE X atom_position2[..G]X atom_velocity[..G]X atom_type[..G]X atom_mass[..G]atom_orders_compute_1X atom_position2[..G]atom_nonforce_write_1[..G]X atom_position2_z[..G]X atom_index[..G]atom_nonforce_sort_1X atom_position2[..G]atom_nonforce_write_1[..G]X atom_position2[..G] (sorted)X atom_velocity[..G]atom_nonforce_write_1[..G]X atom_velocity[..G] (sorted)X atom_mass[..G]atom_nonforce_write_1[..G]X atom_mass[..G] (sorted)X atom_type[..G]atom_nonforce_write_1[..G]X atom_type[..G] (sorted)X atom_position2_z[..G]atom_orders_compute_1[..G]X atom_position2_z[..G] (sorted)X atom_index[..G]atom_orders_compute_1[..G]X atom_index[..G] (sorted)fluid_weight_sum_2X atom_position2[..G] (sorted)atom_nonforce_sort_1[..G]X fluid_weight_2[..G+S] (local)X atom_position2_z[G..] (sorted)atom_nonforce_sort_1[..G]fluid_weight_sync_22_startX fluid_weight_2[-G-S..G+S] (local)fluid_weight_sum_2[..G+S] (local)X fluid_weight_2[-G-S..G+S] (local)POST FORCE atom_force_write_1X atom[..G]POST FORCE X atom_force[..G]atom_force_sort_1X atom_force[..G]atom_force_write_1[..G]X atom_force[..G] (sorted)X atom_index[..G] (sorted)atom_nonforce_sort_1[..G]fluid_weight_sync_22_finishX fluid_weight_2[..G+S] (local)fluid_weight_sum_2[..G+S] (local)X fluid_weight_2[..G+S]X fluid_weight_2[-G-S..G+S] (remote)fluid_weight_sync_22_start[-G-S..G+S] (local)atom_force_next_1X fluid_weight_2[..G+S]fluid_weight_sync_22_finish[..G+S]X atom_spread[..G] (sorted)X fluid_density_2[..G+S]fluid_param_next_2[..G+S/D]atom_kinetic[..G] (sorted)X fluid_velocity_2[..G+S] (half) (vcm)fluid_param_next_2[..G+S] (half) (vcm)X atom_force[..G] (fluid) (sorted)X atom_position2[..G] (sorted)atom_nonforce_sort_1[..G] (sorted)X atom_velocity[..G] (sorted)atom_nonforce_sort_1[..G] (sorted)X atom_mass[..G] (sorted)atom_nonforce_sort_1[..G] (sorted)X atom_type[..G] (sorted)atom_nonforce_sort_1[..G] (sorted)X atom_force[..G] (sorted)atom_force[..G] (sorted)fluid_force_next_2X atom_position2[..G] (sorted)atom_nonforce_sort_1[..G] (sorted)X fluid_force_2[..G+S] (local)X atom_position2_z[..G] (sorted)atom_nonforce_sort_1[..G] (sorted)X atom_spread[..G] (sorted)atom_force_next_1[..G] (sorted)X atom_force[..G] (fluid) (sorted)atom_force_next_1[..G] (fluid) (sorted)X fluid_weight_2[..G+S]fluid_weight_sync_22_finish[..G+S]fluid_force_sync_20_startX fluid_force_2[0..G+S] (local)fluid_force_next_2[..G+S] (local)X fluid_force_2[0..G+S] (local)fluid_force_sync_20_finishX fluid_force_2[..0] (local)fluid_force_next_2[..G+S] (local)X fluid_force_2[..0]X fluid_force_2[-G-S..0] (remote)fluid_force_sync_20_start[0..G+S] (local)fluid_param_u_correct_0X fluid_density_2[..0]fluid_param_next_2[..G+S/D]X fluid_velocity_2[..0] (vcm)X fluid_velocity_2[..0] (half) (vcm)fluid_param_next_2[..G+S] (half) (vcm)X fluid_force_2[..0]fluid_force_sync_20_finish[..0]atom_force_unsort_1X atom_force[..G] (fluid) (sorted)atom_force_next_1[..G] (fluid) (sorted)X atom_force[..G] (fluid)X atom_index[..G] (sorted)atom_nonforce_sort_1[..G] (sorted)atom_force_read_1X atom_force[..G] (fluid)atom_force_unsort_1[..G] (fluid)X atom[..G]FINAL INTEGRATE END OF STEP vcm_total_calc_0X atom[..G]ENDOFSTEP X vcm_totalX fluid_density_2[..0]fluid_param_next_2[..G+S/D]X fluid_velocity_2[..0] (vcm)fluid_param_u_correct_0[..0] (vcm)fluid_vcm_remove_0X vcm_totalvcm_total_calc_0X fluid_velocity_2[..0]X fluid_velocity_2[..0] (vcm)fluid_param_u_correct_0[..0] (vcm)X fluid_dist_3[..0]X fluid_dist_3[..0] (vcm)fluid_dist_next_2[..G+S] (vcm)fluid_dist_eq_3[..0]fluid_dist_eq_3[..0] (vcm)fluid_dist_eq_next_0[..0] (vcm)fluid_dist_eqn_3[..0]fluid_dist_eqn_3[..0] (vcm)fluid_dist_eqn_next_3[..G+S+F] (vcm)restartWritefluid_force_2[..0]fluid_force_sync_20_finish[..0]fluid_dist_3[..0]fluid_vcm_remove_0[..0]fluid_dist_eq_3[..0]fluid_vcm_remove_0[..0]fluid_dist_eqn_3[..0]fluid_vcm_remove_0[..0]fluid_dist_sync_13_startX fluid_dist_3[-G-S-F..-1]fluid_vcm_remove_0[..0]X fluid_dist_3[-G-S-F..-1]fluid_dist_eq_next_0(next)X fluid_density_2[..D]fluid_param_next_2[..G+S/D]X fluid_dist_eq_3[..0] (next) (vcm)X fluid_velocity_2[..0]fluid_vcm_remove_0[..0]X fluid_force_2[..0]fluid_force_sync_20_finish[..0]fluid_dist_eq_sync_13_start(next) (vcm)X fluid_dist_eq_3[-G-S-F..-1] (next) (vcm)fluid_dist_eq_next_0[..0] (next) (vcm)X fluid_dist_eq_3[-G-S-F..-1] (next) (vcm)
[ ]Remove explicit equilibrium distribution
12. DONE Discovered code fixes [7/7]
[X]Missed one-processor GPU boundary exchange optimization forfluid_force[X]Missing OpenCL event free on most kernel calls[X]Remove momentum code is broken[X]Only waits on OpenCL events (needs to correctly handle MPI)[X]fluid_momentum_removeshould be usingfluid_velocity_2andfluid_density_2so it sees the results offluid_correctu_next_0[X]Fix should just set flag to enable an not call directly (to get in correct spot)[X]Verify step passed in is correct (should it besteporstep+1– former is correct)[X]Kernel definesdist_3_new = ... (step & 0x01) ...wherenewshould actually be bestep+1 & 0x01[X]Test periodic boundary conditions- call fix momentum, should leave it zero
- add small body force, remove momentum every 10 steps, should see grow and then drop
[X]Fix lb/fluid/rigid/pc/sphere `omega` calculation diverges a bit on restart- likely issue with `setup` `omega` calculation vs `initial/finalintegrate` one
- replaced by the standard rigid fix (gone in next iteration)
[X]fluid_correctu_next_2is called when we only havefluid_force_2 [0]boundaries- verified with Colin and should only be calculating on interior
fluid_correctu_next_0
- verified with Colin and should only be calculating on interior
[X]Likely shifts momentum in first step due toInitializeFirstRuncomputing fluid force- this would be applied in the first 1/2 step
- for consistency with other other lammps integrators, should be zero for this
- first calculation should occur in post force routine
[X]Not getting good local size fromkernel_local_box- Working correctly, was limited by the 64 work groups per compute unit requirement
13. TODO Code cleanups [15/31]
[ ]fluid_dist_eqn_next_3has duplicate non-localprefixedn3,stide3, andoffets3variables[X]Could use some whitespace clean as there is some tailing whitespace and a mix of tabs vs spaces[ ]Should make the offseting in the?boundaries_*gpu routines more readable (same style as others)[X]Expect the(__global realType*)casts in new gpu code are not required- Are actually as it is loading a single component from variable locations
[ ]See if now gpu bounce back code could be written in a nicer vectorized way[ ]Checking queue properties of devices should use existing pre-wrapped error check versions[X]Drop localmeparameter (currently a mix of localmeandcomm->me)[ ]Proper setting oftypeLBis not verifed (required for things like initialization ofK0)[ ]gridgeo_2memory should beconstin all (most?) GPU kernels[X]Backtrace cleanup is a mess and printing garbage[3/3]- Need to compile with the
-rdynamicoption to not strip out static function names [X]Bug due tobacktraceallocating one chunk for strings and pointers[X]Exception safety via C++11unique_ptrrewrite[X]Figure out why function names are included
- Need to compile with the
[X]Fix differences between fluid dumping and atom dumping (via dump routines)- Dump code (
ntimestepsbased) and fluid code (stepbased) are off by one in theirupdate->ntimestepis0insetupand then1for the first stepstepis-1insetupand then0for first step
- Dump code dumps in both
setupand regular steps while fluid code only does regular steps
- Dump code (
[ ]Check with Colin about treatingsw(y sidewalls) boundary the same as z one (currently one point short)[ ]Renamelatticevariables to match GPUgeogridname[ ]Baselatticesize calculations on gpu size variables and notsubNb{x,y,z}ones[X]Make code cleaner by using
post_run(can get rid of everything but send bit insetup)- not doing as isn't cleaner due to
setup(orinit) mismatch withpost_runsetupis always run the first time and then only if the runpreflag is setpost_runis always run
- how to handle calculation of forces in setup (1/2 step velocity is no longer available)
- setup compute the forces
- no need for momentum remaval (user will have disabled if they have introduce net momentum)
- setup minimal, we are okay, no recomputations required
- setup compute the forces
constructor setupinitial_integrate… final_integratepost_rundestructor init_dist_*newsend_dist_*newrecv_dist_* oldsend_dist_eqnewdrop_dist_eqnew+calc_forcexchg_forcecalc dist_eqnew+send_dist_eqnew+recv_dist_eq new… send dist_*newrecv_dist_*newcalc dist_eqnew+send dist_eqnew+recv_dist_eqnew+init_dist_*newif not ~post_runsend dist_*new: recv_dist_*oldcalc_forcexchg_forcecalc dist_eqnew+send_dist_eqnew+: recv_dist_eqnew… send dist_*newrecv_dist_*newcalc dist_eqnew+send dist_eqnew+recv_dist_eqnew+- not doing as isn't cleaner due to
[ ]Should be able to disable profiling as may reduce performance[ ]Can probably drop explicitstd::stringconstructors[X]Add ghost points suffix tofluid_momentum_remove_kernel- Was renamed to
fluid_vcm_remove_0
- Was renamed to
[X]Clean up gpu boundary exchange naming to match others[ ]Cleanup compilation warnings[ ]Fixsize_trelated narrowing conversion warnings
[X]Combine profile averages and only output from first rank[X]fluid_correctu_next_2should be justfluid_correctu_next_0as onlyfluid_force_2 [0]is available[X]Clarify components acted upon in write, rewrite, sort, resort, and read in names- Made in final version of action code
[ ]Deduplicate buffer specific inner syncing with macros like outer ones[X]sync_inner_*routines' inner target buffers are larger than they need to be (outside_offset_*[1] = mem_border)[X]atom_force_nextshould clip stencil to avoid past end access if atom goes out of bounds[X]Floating point constants without f suffix wull causing calculation to be in double[X]Consistent handling and naming of boundary conditions- Prefix with effect (wall, pressure) and
- Switch to global location condition inside of GPU
[ ]Could skipatom_force_read_1if gamma scalling is all negative for all atom types[ ]Should switchatom_force(combined) from fluid to atom units to avoid confusion[ ]Replacelongwithbigint
14. TODO Documentation updates [0/4]
[ ]need documentation as to why the routines[ ]and a description of the comments[ ]add tables to documentation (describe the cloumns and the bracketed terms)[ ]note about the HEX file needing to be there for the build
15. TODO Investigate potential GPU options before switch to HIP [5/12]
[X]Full profile information dump[2/2][X]Switch from total runtimes to individual time points[X]Output profile information
[X]Remove waits from sorting routines[1/1]- should we stick all the event handlers in a queue
[X]Try on
atom_sort_position2_z_*routinesoperation wait between wait at end speedup sort_..._bb7.57 5.98 21.0% sort_..._bm6.82 5.32 22.0% combined wall 7.71 5.29 31.4%
[ ]Test run with weight set to1and withtrilinear- Is the atom searching or the stencil costing the most
[ ]Can optimize complete map writing withCL_MAP_WRITE_INVALIDATE_REGIONinstead ofCL_MAP_WRITE[ ]Mapping pinned memory should give the fastest access- NVIDIA OpenCL optimization guide explains how to suggest pinning
[ ]Use memory mapping to copy buffers- fluid distribution reads and force array
- may be able to work with unified memory
[X]Record starting and stopping times for profiling to determine key path holdups[ ]Non-blocking GPU operations do not nessesarily execute unless there is a flush- Likely okay under NVIDIA except for Windows (from NVIDIA OpenCL optimization guide)
- Could be a reason for the slow down under AMD
[ ]Where would it be worthwhile to use local memory[X]Sometimesfluid_force_next_2is 3.7x slower (1.03 to 3.81ms)- tracked down to broken riser card on gra986 asserting hardware power brake on GPU card
[ ]Check regular code on thread safe MPI- Cluster OpenMPI supports thread if initialized with
MPI_Init_threadinsteadMPI_Initas in LAMMPS - Doesn't seem to be any significant slowdown from replacing
MPI_InitwithMPI_Init_thread
- Cluster OpenMPI supports thread if initialized with
[X]Use events between groups
16. Item for future
- Explore some ideas around geogrid code
- Use bit vector to track bounce back status for each cell
- Could store in texture memory as constant for entire simulation
- Arbitrary geometry support by loading mask of fluid vs non-fluid
- could the new
gridgeosystem be used to subsume the wall bounce back code (moving walls likely an issue)
- could the new
- Use bit vector to track bounce back status for each cell
- Resync with upstream lammps
- now includes a cmake build system
- Cleanup old subNbx initialization calculations
- had intended to in prior DP, but never got around to it
- Changes in force to handle multiple particles contributing to a grid point better
- Can likely incorporate into current one pass on GPU
- Better CPU/GPU border exchange code (current all or nothing)
- verify GPU options work correctly