Colin DP

1. DONE Review/get up to speed on changes [4/4]
2. DONE Merge in MPI code [10/10]
3. DONE Add GPU (and CPU) profiling [8/8]
4. DONE Profile analysis system [9/9]
5. DONE Profile discovered optimizations/fixes [5/5]
6. DONE Free up more concurrency [4/4]
7. DONE Change ordering of array in GPU version [12/12]
8. DONE Resync/compare with reference code/LAMMPS [8/8]
9. DONE Selection of GPU to use [2/2]
10. DONE Implement new two pass interpolation [4/4]
11. TODO Resync with reference CPU code [2/4]
12. DONE Discovered code fixes [7/7]
13. TODO Code cleanups [15/31]
14. TODO Documentation updates [0/4]
15. TODO Investigate potential GPU options before switch to HIP [5/12]
16. Item for future

Scratch space for tracking progress and keep track of items on Colin's DP.

1. DONE Review/get up to speed on changes `[4/4]`

[X] Added variety of items to cleanups todo
[X] Reviewed the new boundary items and gridgeo structure (excluding the pit setup code)
- type
  - 0- bounce backs (bulk fluids nodes)
  - 1- bounce backs (pit geometry boundary fluid nodes)
  - 2- not in fluid
- orientation (number of bounce backs)
  - 0- no bounce back (couldn't this be used for bulk fluid?)
  - 3- not sure – all the diagonals on a face minus 1?
  - 4- not sure – two faces (inside edge) minus all furthest diagonals?
  - 5- against a face
  - 6- not sure – seems to be face and one of the opposite diagonals?
  - 8- against two faces (inside edge)
  - 10 - against three faces (inside corner)
[X] Make tyson branch master
- Master branch (fracnes/final-fixes tag/frances branch + one commit from Colin) is a subset of tyson branch.
[X] Figure out where MPI branch is at (sync with tyson)
- MPI branch only uncomments the transfer and exchange calls.

2. DONE Merge in MPI code `[10/10]`

[X] Verify that the MPI_ORDER_C,FORTRAN stuff is okay (per comment there may be an issue): pass
[X] Check if woundup with both GPU and CPU border exchange active: pass
[X] Fix platform initialization on later CUDAs (CPU platform won't iterate)
[X] Check that boundary exchange works with at least two processors along each dimension: fail [3/3]
- Need to set comm_modify cutoff to 2.5x dx to have required particles
- [X] Periodic MPI deadlocks in first steps
- [X] Crash at end-of-simulation
- [X] Split along z is okay, split along y has agreement issues, and split along z looses atoms
[X] Fixed bug causing periodic MPI boundary exchange lockups
- fixviscouslb wasn't initialized if unused so fluid force could occur at different times
[X] Fix end-of-simulation crash
- comm was being destructed before fixes (modify)
- Upstream pull request to fix – from discussion should switch to post_run hook
[X] Double checked no other local state variables are unset: pass
[X] Switch boardary exchange to be GPU if only one processor and MPI otherwise
[X] Look into potential issues with fluid distribution exchange being in flight: pass
- Constructor sets bogus fluid distribution exchange in flight
- setup (called at the start of a run) finishes the one if flight, throws it away, computes the proper one for the run, and starts it in flight
- Step routines expert fluid distribution in flight at start and put it inflight at end
- Destructor finishes fluid distribution exchange in flight
[X] Debug and fix disagreement between simulation results for MPI and non-MPI runs (two processor splits)
- [X] Get particle dumping working to visualize in paraview
- [X] Particles are lost when splitting x
  - Code for pressurebcx looks like it incorrectly applies to the internal side too for boundary processes
- [X] Fix pressurebcx applying to internal side for boundary processes
- [X] Forces are off when splitting x or y [9/9]
  - [X] Test with individual atoms, bodyforce, and no pressurebcx: fail
    - Issue when split along fixed boundary side, otherwise perfect agreement
  - [X] Test with rigid sphere, bodyforce, and no pressurebcx: pass
  - [X] Test with rigid sphere, bodyforce, no pressurebcx, and pits: pass
  - [X] Test without particle interactions: fail (not caused by particle interaction code)
  - [X] Figure out how to do paraview visualization of differences
    - Append attributes filter on multiple inputs and then calculator for relative error
    - Minimal size test example
    - Error starts in split and then jumps to ends as well
  - [X] Double checked EDGE_Z{0,1} usage for acting on both sides
  - [X] Test with all edge code disabled: failed
  - [X] Test with 3 processors along z: error on both boundaries, starting on left outer one: error on both sides of ends
  - [X] Looks into end extra point old adjustment code
- [X] Figure out why splitting along fixed boundary gives different results for MPI [3/3]
  - [X] Add boundary dump code after each GPU routine call
  - [X] Create a serial vs parallel dump comparison program
  - [X] Extend boundary dump to entire field
- [X] Fix discovered geogrid issue (requires 3 boundary points and not 2)
  - [X] Add additional boundary point to calculation
    - [X] sublattice initialization code uses wholelattice wrap of Nbz: correct as only applies to boundary
    - [X] Update dump routines
    - [X] Update gpu routines to use 3 boundary offsets value and names

3. DONE Add GPU (and CPU) profiling `[8/8]`

[X] Read up on profiling options
[X] Turn profiling one in clCreateCommandQueue
[X] Add time tracking to kernels [18/18]
- [X] fluid_dist_eq_initial_3_kernel
- [X] fluid_dist_eq_next_0_kernel
- [X] fluid_dist_eqn_next_3_kernel
- [X] fluid_dist_initial_2_kernel
- [X] fluid_dist_next_2_kernel
- [X] fluid_param_initial_2_kernel
- [X] fluid_param_next_2_kernel
- [X] fluid_correctu_next_2_kernel
- [X] xboundaries_fluid_dist_eq_kernel
- [X] yboundaries_fluid_dist_eq_kernel
- [X] zboundaries_fluid_dist_eq_kernel
- [X] xboundaries_fluid_dist_kernel
- [X] yboundaries_fluid_dist_kernel
- [X] zboundaries_fluid_dist_kernel
- [X] xboundaries_fluid_force_kernel
- [X] yboundaries_fluid_force_kernel
- [X] zboundaries_fluid_force_kernel
- [X] remove_momentum_kernel
[X] Print profiling at end
[X] Add time tracking to memory read/writes [7/7]
- [X] fluid_dist_3_1_interior_read
- [X] fluid_dist_eq_3_3_interior_read
- [X] fluid_force_2_exterior_read
- [X] fluid_force_2_accumulate_read
- [X] fluid_dist_3_1_exterior_write
- [X] fluid_dist_eq_3_3_exterior_write
- [X] fluid_force_2_accumulate_write
[X] Debug segfault introduced in unrelated OpenCL call [1/1]
- [X] Break appart commits and bisect
  - Accidentally removed queue assignment in clCreateCommandQueue call
[X] Revamp profiling to fix leaks and reduce boilerplate [4/4]
- [X] C++ template magic class to progressively push location information
  - Dreadful failure
- [X] Abstract with profile records: location, rank, info, value
- [X] Resolve race conditions/locking by dumping different ranks to different files
- [X] Switch to binary format for space
  - Compact everything into a location field by combining bits
  - Reduce data required to pass between functions
  - Replace strings with enums
[X] Add CPU profile points for start and end of fix callouts and expensive CPU operations [8/8]
- [X] GPU timer requires OpenCL 2.1 (no NVIDIA), use CPU timer instead
- [X] initial_integrate
- [X] pre_force
- [X] post_force
- [X] final_integrate
- [X] atom_read
- [X] atom_write
- [X] fluid_force_accumulate

4. DONE Profile analysis system `[9/9]`

[X] Initial analysis for intra-timestep details
- [X] Python code extract binary data format
- [X] Switch to R as python pandas apply functionality too slow
  - nearly 8 minutes in python pandas vs 2 seconds in R tidyverse
  - don't have a good way to load data: use python to save as a feather file
- [X] Initial plot of timestep
  - transfer and atom related routines seem to be taking most time
[X] Fix issues revealed in profile code
- [X] Distinguish between two uses of fluid_dist_read/write routines
- [X] Fix non-unique fluid_dist_read/write profile points (eq vs non-eq)
[X] Add group brackets to step breakout
[X] Distinguish between executing and non-executing state in intra analysis
[X] Add cross-timestep analysis [2/2]
- [X] Plot of walltime for each step (group analysis)
  - Neighbour calculation is very expensive (x100 over regular step)
- [X] Distributions of walltime for each group of GPU calls (inter analysis)
[X] Add reference to distribution to group analysis
[X] Integrate addition of CPU clocks
- [X] Redo implicit ordering calculation code (no longer simple)
  - Add substep calculation as CPU and GPU out of order wrt substep
- [X] Syncronize CPU and GPU clocks from data
  - Have to correct for slight clock skew as well as offset (linear model)
- [X] Add special cases to handle missing events in CPU only profiles
[X] Mark run periods of all LAMMPS fix calls in analysis
[X] Drop python code required for initial loading of data

5. DONE Profile discovered optimizations/fixes `[5/5]`

[X] Switch atom_sort_index to write to recorded index position instead of resorting [10/10]
- 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymer_sphere loop time 96.5 -> 84.3 (-12.6%)
- [X] Add memory for copy out space (put in end)
- [X] Adjust memory resize routines
- [X] Create new GPU function atom_unsort_position2_z and remove old ones
- [X] Add kernel variables and initialize and deinitialize
- [X] Add local and global size variables and initialize
- [X] Calls for initial arguments
- [X] Update all arguments (need to include force and index too!) on atom resize
- [X] Add profile enums and remove old ones
- [X] Call to invoke with final arguments
- [X] Update atom_read
[X] Switch second atom_sort_position2_z to just gather new velocity values instead of resorting everything [10/10]
- 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymer_sphere loop time 84.3 -> 73.5 (-12.8%)
- [X] Add memory for copy out space (put in end)
- [X] Adjust memory resize routines
- [X] Add new GPU function atom_resort_position2_z
- [X] Add kernel variable and initialize and de-initialize
- [X] Add local and global size variables and initialize
- [X] Calls for initial arguments
- [X] Update all arguments (need to include index too!) and local and global size on atom resize
- [X] Add profile enum
- [X] Call to invoke with final arguments
- [X] Add atom_rewrite

[X] Investigate and fix duplicate running of atom related routines

2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymer_sphere loop time 73.5 -> 69.0 (-6.1%)
If no forces added via fix_lb_viscous_gpu and or fix_lb_rigid_pc_sphere_gpu could skip a bunch of atom routines including correct_u
The positions don't change so the sort mapping is fixed throughout
The 1/2 step velocity is the correct one to be using and not the 1 step as is being done
Push atom_force down down to post_force (more lammps standard and would like to be able to add forces into GPU routine)
Technically correct_u in is a final_integrate sort of thing so put it there

Steps overview

Routine	Operation
`fluid_dist_exchange_finish-1`	Recieve `fluid_dist_3_new-` exterior [G+S+F] (for previous step)
`fluid_dist_eq_exchange_finish`	Recieve `fluid_dist_eq_3_new` exterior [1..G+S+F]
`fluid_dist_eqn_next_3`	`fluid_dist_eq_3_new` [0..G+S+F] -> `fluid_dist_eqn_3_new` [0..G+S+F]
`fluid_dist_next_2`	`fluid_dist_~{eq,eqn}_3_{new,old} [0..G+S+F], ~fluid_dist_3_old` [0..G+S+F], `fluid_density_2` [..F] -> `fluid_dist_3_new` [0..G+S]
`restartWrite`	`fluid_dist_eq_3_new` [0..G+S+F], `fluid_dist_3_new` [0..G+S] -> restart file (if requested)
`fluid_dist_exchange_start`	Send `fluid_dist_3_new` interior [-G-S-F]
`fluid_param_next_2_start`	Start `fluid_dist_3_new` [0..G+S] -> `fluid_velocity_2` [0..G+S], `fluid_density_2` [0..G+S]
`fluid_param_next_2_finish`	End `fluid_dist_3_new` [0..G+S] -> `fluid_velocity_2` [0..G+S], `fluid_density_2` [0..G+S]
`atom_write`	Atom [G] -> `atom_position2` [G], `atom_velocity` [G], `atom_type` [G], `atom_mass` [G]
`atom_position2_z_index`	`atom_position2` [G] -> `atom_position2_z` [G], `atom_index` [G]
`atom_sort_position2_z`	Sort by z `atom_position2` [G], `atom_velocity` [G], `atom_mass` [G], `atom_type` [G], `atom_position2_z` [G], `atom_index` [G]
`atom_force`	`fluid_velocity_2` [0..G+S], `fluid_density_2` [0..G+S], `atom_position2` [G], `atom_velocity` [G], `atom_mass` [G], `atom_type` [G] -> `atom_force` [G]
`fluid_force_next_2`	`atom_position2` [G], `atom_position2_z` [G], `atom_force` [G] -> local `fluid_force_2` [0..G+S]
`fluid_force_exchange_start`	Send local `fluid_force_2` exterior [0..G+S]
`fluid_force_exchange_finish`	Recieve remote `fluid_force_2` interior [-G-S..0]
	Local `fluid_force_2` [0], remote `fluid_force_2` [0] -> `fluid_force_2` [0]
`fluid_correctu_next_2`	`fluid_velocity_2` [0..G+S], `fluid_density_2` [0..G+S], `fluid_force_2` [0..G+S] -> `fluid_velocity_2` [0..G+S]
`fluid_dist_eq_next_0+1`	`fluid_velocity_2` [0], `fluid_density_2` [0..D], `fluid_force_2` [0] -> ~fluid_dist_eq₃_new~+ [0] (for next step)
`fluid_dist_eq_exchange_start+1`	Send ~fluid_dist_eq₃_new~+ interior [-G-S-F..-1] (for next step)
`atom_sort_index/atom_unsort_position2_z`	Sort by index `atom_force` [G], `atom_index` [G]
`atom_read`	`atom_force` [G] -> hydroF [G]

Reorganization

		Pre-`fix_lb_rigid_pc_sphere_gpu`	Post-`fix_lb_rigid_pc_sphere_gpu`	Post-`fix_lb_rigid_pc_sphere_gpu`	New
Step	Lammps		(with `fix_lb_viscous_gpu`)	(without `fix_lb_viscous_gpu`)
`initial_integrate`	coordinates 1	`fluid_dist_exchange_finish-1`	`fluid_dist_exchange_finish-1`	`fluid_dist_exchange_finish-1`	`fluid_dist_exchange_finish-1`
	velocities 1/2	`fluid_dist_eq_exchange_finish`	`fluid_dist_eq_exchange_finish`	`fluid_dist_eq_exchange_finish`	`fluid_dist_eq_exchange_finish`
		`fluid_dist_eqn_next_3`	`fluid_dist_eqn_next_3`	`fluid_dist_eqn_next_3`	`fluid_dist_eqn_next_3`
		`fluid_dist_next_2`	`fluid_dist_next_2`	`fluid_dist_next_2`	`fluid_dist_next_2`
		`restartWrite`	`restartWrite`	`restartWrite`	`restartWrite`
		`fluid_dist_exchange_start`	`fluid_dist_exchange_start`	`fluid_dist_exchange_start`	`fluid_dist_exchange_start`
		`fluid_param_next_2_start`	`fluid_param_next_2_start`	`fluid_param_next_2_start`	`fluid_param_next_2_start`
`post_integrate`
`pre_exchange`
`pre_neighbour`
`post_neighbour`
`pre_force`		`fluid_param_next_2_finish`	`fluid_param_next_2_finish`	`fluid_param_next_2_finish`	`fluid_param_next_2_finish`
		`atom_write`	`atom_write`	`atom_write`	`atom_write`
		`atom_position2_z_index`	`atom_position2_z_index`	`atom_position2_z_index`	`atom_position2_z_index`
		`atom_sort_position2_z`	`atom_sort_position2_z`	`atom_sort_position2_z`	`atom_sort_position2_z`
		`atom_force`	`atom_force`	`atom_force`
		`fluid_force_next_2`	`fluid_force_next_2`
		`fluid_force_exchange_start`	`fluid_force_exchange_start`
		`fluid_force_exchange_finish`	`fluid_force_exchange_finish`
			`fluid_correctu_next_2`
		`fluid_dist_eq_next_0+1`	`fluid_dist_eq_next_0+1`
		`fluid_dist_eq_exchange_start+1`	`fluid_dist_eq_exchange_start+1`
`pre_reverse`
`post_force`	additional forces	`atom_sort_index`	`atom_sort_index`	`atom_sort_index`	`atom_force`
		`atom_read`	`atom_read`	`atom_read`	`fluid_force_next_2`
					`fluid_force_exchange_start`
					`fluid_force_exchange_finish`
					`atom_unsort_position2_z`
					`atom_read`
`final_integrate`	velocity 1				`fluid_correctu_next_2`
					`fluid_dist_eq_next_0+1`
					`fluid_dist_eq_exchange_start+1`
`end_of_step`				`atom_write`
				`atom_position2_z_index`
				`atom_sort_position2_z`
				`atom_force`
				`fluid_force_next_2`
				`fluid_force_exchange_start`
				`fluid_force_exchange_finish`
				`fluid_correctu_next_2`
				`fluid_dist_eq_next_0+1`
				`fluid_dist_eq_exchange_start+1`

[X] Queuing multiple jobs appears to sometimes slowdown unrelated computation routines
- Defective riser card on GPU asserting power brake and cutting clocks to 1/3rd
[X] Duplicate rectange read in profile code (code wasn't currently used)

6. DONE Free up more concurrency `[4/4]`

[X]

Remove blocking OpenCL calls (all explicit dependencies)

Done	Routine	Input Buffers	Source	Output Buffers
	INITIAL INTEGRATE
X	`fluid_dist_sync_13_finish` (prev)	`fluid_dist_3` [..0] (prev)	`fluid_vcm_remove_0` [..0] (prev)	`fluid_dist_3` [1..G+S+F] (prev)
X	`fluid_dist_eq_sync_13_finish` (prev)	`fluid_dist_eq_3` [..0] (prev)	`fluid_vcm_remove_0` [..0] (prev)	`fluid_dist_eq_3` [1..G+S+F] (prev)
X	`fluid_dist_eqn_sync_13_finish` (prev)	`fluid_dist_eqn_3` [..0] (prev)	`fluid_vcm_remove_0` [..0] (prev)	`fluid_dist_eqn_3` [1..G+S+F] (prev)
X	`fluid_dist_eq_sync_13_finish` (vcm)	`fluid_dist_eq_3` [..0] (vcm)	`fluid_dist_eq_next_0` [..0]	`fluid_dist_eq_3` [1..G+S+F] (vcm)
X	`fluid_dist_eqn_next_3`	`fluid_dist_eq_3` [..G+S+F] (vcm)	`fluid_dist_eq_sync_13_finish` [1..G+S+F] (vcm)	`fluid_dist_eqn_3` [..G+S+F] (vcm)
X	`fluid_dist_next_2`	`fluid_dist_3` [..G+S+F] (prev)	`fluid_dist_sync_13_finish` [1..G+S+F] (prev)	`fluid_dist_3` [..G+S] (vcm)
		`fluid_dist_eq_3` [..G+S+F] (prev)	`fluid_dist_eq_sync_13_finish` [1..G+S+F] (prev)
		`fluid_dist_eqn_3` [..G+S+F] (prev)	`fluid_dist_eqn_sync_13_finish` [1..G+S+F] (prev)
		`fluid_dist_eq_3` [..G+S+F] (vcm)	`fluid_dist_eq_sync_13_finish` [1..G+S+F]
		`fluid_dist_eqn_3` [..G+S+F] (vcm)	`fluid_dist_eqn_next_3` [..G+S+F]
X	`fluid_param_next_2`	`fluid_dist_3` [..G+S] (vcm)	`fluid_dist_next_2` [..G+S] (vcm)	`fluid_density_2` [..G+S]
				`fluid_velocity_2` [..G+S] (half) (vcm)
	PRE FORCE
X	`atom_nonforce_write_1`	`atom` [..G]		`atom_position2` [..G]
				`atom_velocity` [..G]
				`atom_type` [..G]
				`atom_mass` [..G]
X	`atom_orders_compute_1`	`atom_position2` [..G]	`atom_nonforce_write_1` [..G]	`atom_position2_z` [..G]
				`atom_index` [..G]
X	`atom_nonforce_sort_1`	`atom_position2` [..G]	`atom_nonforce_write_1` [..G]	`atom_position2` [..G] (sorted)
		`atom_velocity` [..G]	`atom_nonforce_write_1` [..G]	`atom_velocity` [..G] (sorted)
		`atom_mass` [..G]	`atom_nonforce_write_1` [..G]	`atom_mass` [..G] (sorted)
		`atom_type` [..G]	`atom_nonforce_write_1` [..G]	`atom_type` [..G] (sorted)
		`atom_position2_z` [..G]	`atom_orders_compute_1` [..G]	`atom_position2_z` [..G] (sorted)
		`atom_index` [..G]	`atom_orders_compute_1` [..G]	`atom_index` [..G] (sorted)
	POST FORCE
X	`atom_force_write_1`	`atom` [..G]		`atom_force` [..G]
X	`atom_force_next_1`	`fluid_density_2` [..G+S]	`fluid_param_next_2` [..G+S]	`atom_force` [..G] (sorted)
		`fluid_velocity_2` [..G+S] (half) (vcm)	`fluid_param_next_2` [..G+S] (half) (vcm)
		`atom_position2` [..G] (sorted)	`atom_inputs_sort_1` [..G] (sorted)
		`atom_velocity` [..G] (sorted)	`atom_inputs_sort_1` [..G] (sorted)
		`atom_mass` [..G] (sorted)	`atom_inputs_sort_1` [..G] (sorted)
		`atom_type` [..G] (sorted)	`atom_inputs_sort_1` [..G] (sorted)
X	`fluid_force_next_2`	`atom_position2` [..G] (sorted)	`atom_inputs_sort_1` [..G] (sorted)	`fluid_force_2` [..G+S] (local)
		`atom_position2_z` [..G] (sorted)	`atom_inputs_sort_1` [..G] (sorted)
		`atom_force` [..G] (sorted)	`atom_force_next_1` [..G] (sorted)
X	`fluid_force_sync_20_start`	`fluid_force_2` [0..G+S] (local)	`fluid_force_next_2` [..G+S] (local)
X	`fluid_force_sync_20_finish`	`fluid_force_2` [-G-S..0] (local)	`fluid_force_next_2` [..G+S] (local)	`fluid_force_2` [-G-S..0]
X	`fluid_param_u_correct_0`	`fluid_velocity_2` [..0] (half) (vcm)	`fluid_param_next_2` [..G+S] (half) (vcm)	`fluid_velocity_2` [..0] (vcm)
		`fluid_density_2` [..0]	`fluid_param_next_2` [..G+S]
		`fluid_force_2` [..0]	`fluid_force_sync_20_finish` [..0]
X	`atom_force_unsort_1`	`atom_force` [..G] (sorted)	`atom_force_next_1` [..G] (sorted)	`atom_force` [..G]
		`atom_index` [..G] (sorted)	`atom_inputs_sort_1` [..G] (sorted)
X	`atom_force_read_1`	`atom_force` [..G]	`atom_force_unsort_1` [..G]	`atom` [..G]
	FINAL INTEGRATE
	END OF STEP
X	`vcm_total_calc_0`	`atom` [..G]		`vcm_total`
		`fluid_density_2` [..0]	`fluid_param_next_2` [..G+S]
		`fluid_velocity_2` [..0] (vcm)	`fluid_param_u_correct_0` [..0] (vcm)
X	`fluid_vcm_remove_0`	`vcm_total`	`vcm_total_calc_0`	`fluid_velocity_2` [..0]
		`fluid_velocity_2` [..0] (vcm)	`fluid_param_u_correct_0` [..0] (vcm)	`fluid_dist_3` [..0]
		`fluid_dist_3` [..0] (vcm)	`fluid_dist_next_2` [..G+S] (vcm)	`fluid_dist_eq_3` [..0]
		`fluid_dist_eq_3` [..0] (vcm)	`fluid_dist_eq_next_0` [..0] (vcm)	`fluid_dist_eqn_3` [..0]
		`fluid_dist_eqn_3` [..0] (vcm)	`fluid_dist_eqn_next_3` [..G+S+F] (vcm)
X	`restartWrite`	`fluid_force_2` [..0]	`fluid_force_sync_20_finish` [..0]
		`fluid_dist_3` [..0]	`fluid_vcm_remove_0` [..0]
		`fluid_dist_eq_3` [..0]	`fluid_vcm_remove_0` [..0]
		`fluid_dist_eqn_3` [..0]	`fluid_vcm_remove_0` [..0]
X	`fluid_dist_sync_13_start`	`fluid_dist_3` [-G-S-F..-1]	`fluid_vcm_remove_0` [..0]
X	`fluid_dist_eq_sync_13_start`	`fluid_dist_eq_3` [-G-S-F..-1]	`fluid_vcm_remove_0` [..0]
X	`fluid_dist_eqn_sync_13_start`	`fluid_dist_eqn_3` [-G-S-F..-1]	`fluid_vcm_remove_0` [..0]
X	`fluid_dist_eq_next_0` (next)	`fluid_density_2` [..D]	`fluid_param_next_2` [..G+S]	`fluid_dist_eq_3` [..0] (next) (vcm)
		`fluid_velocity_2` [..0]	`fluid_vcm_remove_0` [..0]
		`fluid_force_2` [..0]	`fluid_force_sync_20_finish` [..0]
X	`fluid_dist_eq_sync_13_start` (next) (vcm)	`fluid_dist_eq_3` [-G-S-F..-1] (next) (vcm)	`fluid_dist_eq_next_0` [..0] (vcm) (next)

fluid_correctu_next_0 (now fluid_param_u_correct_0) adjusts fluid_velocity_2 [0]
- doesn't adjust any of fluid_dist*_3 [0] (this is correct)
fluid_momentum_remove (now fluid_vmc_remove_0) adjusts fluid_dist_3 [0] and fluid_dist_eq_3 [0]
- should use corrected fluid_velocity_2 [0] instead of computing uncorrected version from fluid_dist*_3 [0]
- need to also correct fluid_dist_eqn_3 [0..G+S+F] for fluid_dist_next_2 if using exponential integrator
- can't compute border points [...G+S+F] (only have fluid_velocity_2 [0]) so need to trasfer these
- need to delay fluid_dist_exchange_3_start until after this
- need to send borders on other outputs too if using exponential integrator
fluid_dist_eq_next_0 computes fluid_dist_eq_3 [0] (next)
- from fluid_velocity_2 [0] (affected by fluid_correctu_next_0 but not affected by fluid_momentum_remove)
fluid_dist_eqn_next_3 computes fluid_dist_eqn_3 [0..G+S+F] (next)
- from fluid_dist_eq_3 [0] (affected by fluid_correctu_next_0 but not affected by fluid_momentum_remove)
- could run earlier if fluid_vmc_remove_0 isn't being run this timestep
fluid_dist_next_2 computes fluid_dist_3 [0..G+S] (next)
- from fluid_dist_3 [0..G+S] (not affected by fluid_correctu_next_0 and not affected by fluid_momentum_remove)
- from fluid_dist_{eq,eqn}_3 [0..G+S+F] (not affected by fluid_correctu_next_0 but affected by fluid_momentum_remove)
- from fluid_dist_{eq,eqn}_3 [0..G+S+F] (next) (affected by fluid_correctu_next_0 but not affected by fluid_momentum_remove)

INITIAL INTEGRATE
`fluid_dist_exchange_13_finish` (prev)	`fluid_dist_3_action` (impl)
`fluid_dist_eq_exchange_13_finish` (prev)	`fluid_dist_eq_3_action` (impl)
`fluid_dist_eqn_exchange_13_finish` (prev)	`fluid_dist_eqn_3_action` (impl)
`fluid_dist_eq_exchange_13_vcm_finish`	`fluid_dist_eq_3_vcm_action` (impl)
`fluid_dist_eqn_next_3`	`fluid_dist_eqn_3_vcm_action`
`fluid_dist_next_2`	`fluid_dist_2_vcm_action`
`restartWrite`
`fluid_param_next_2`	`fluid_velocity_2_half_vcm_action`
	`fluid_density_2_action`
PRE FORCE
`atom_lammps_write_1`	`atom_position1_action`
	`atom_velocity_1_action`
	`atom_type_1_action`
	`atom_mass_1_action`
`atom_orders_compute_1`	`atom_position2_z_1_action`
	`atom_index_1_action`
`atom_inputs_sort_1`	`atom_position2_1_sorted_action`
	`atom_velocity_1_sorted_action`
	`atom_mass_1_sorted_action`
	`atom_type_1_sorted_action`
	`atom_position2_z_1_sorted_action`
	`atom_index_1_sorted_action`
POST FORCE
`atom_force_next_1`	`atom_force_1_sorted_action`
`fluid_force_next_2`	`fluid_force_2_local_action`
`fluid_force_exchange_02_start`
`fluid_force_exchange_20_finish`	`fluid_force_20_remote_action` (impl)
`fluid_force_combine_20`	`fluid_force_0_action`
`atom_force_unsort_1`	`atom_force_1_action`
`atom_force_read_1`	FINAL INTEGRATE (impl)
FINAL INTEGRATE
`fluid_param_u_correct_0`	`fluid_velocity_0_vcm_action`
(formally `fluid_correctu_next_0`)
END OF STEP
`vcm_total_calc_0`	`vcm_total_action` (impl)
`fluid_vmc_remove_0`	`fluid_velocity_0_action`
(formally `fluid_momentum_remove`)	`fluid_dist_0_action`
	`fluid_dist_eq_0_action`
	`fluid_dist_eqn_0_action`
`fluid_dist_exchange_13_start`
`fluid_dist_eq_exchange_13_start`
`fluid_dist_eqn_exchange_13_start`
`fluid_dist_eq_next_0` (next)	`fluid_dist_eq_0_vcm_action`
`fluid_dist_eq_exchange_13_vcm_start` (next)

[X] Print routines should wait on actions/verify range
- have to change pointers to values
[X] Update profile analysis code for changes
[X] Can float fluid_param_u_correct_0 up to post_force

7. DONE Change ordering of array in GPU version `[12/12]`

Can convert routines one at a time by wrapping with transpose code
[X] Have a look at paraview dump code
[X] Switch order of array in kernels [7/7]
- As we are manually calculating from a 1D index, it is technically aligned now
- [X] cl_mem fluid_gridgeo_3_mem
- [X] cl_mem fluid_dist_3_mem
- [X] cl_mem fluid_dist_eq_3_mem
- [X] cl_mem fluid_dist_eqn_3_mem
- [X] cl_mem fluid_density_2_mem
- [X] cl_mem fluid_velocity_2_mem
- [X] cl_mem fluid_force_2_mem
[X] Fix offsetting in dump routines [7/7]
- [X] print_fluid
- [X] print_internal
- [X] buffer_read_rectangle
- [X] buffer_write_rectangle
- [X] fluid_force_accumulate_rectangle
- [X] calc_mass_momentum
- [X] calc_MPT
[X] Switch order of fluid_grid_geo_3_mem
- Map memory, directly initialize with strides, and drop sublattice
[X] Reverse i,j,k loops for optimal stepping
[X] Test new order [2/2]
- [X] Test MPI boundary exchange
- [X] Test GPU boundary exchange
[X] Fix initialization of fluid_grid_geo_3_mem
- Copied from sublattice via buffer_create with CL_MEM_COPY_HOST_PTR
- Wrong type in sizeof for memory map
[X] Fix MPI boundary exchange
- Hadn't updated even/odd offset to now be on k instead of i (largest step)
[X] Switch to 3D threads
[X] Invesigate Z-curve coordinate for atoms positions (current is C order)
[X] Test restart file
- Are old restart files still valid: no, now in Fortran order
[X] Fix restart files under MPI
- Wrong structure element count in MPI file view type declaration

8. DONE Resync/compare with reference code/LAMMPS `[8/8]`

[X] Comparison with reference code
- Reference code recomputes forces after full step when not used with fix_lb_viscous_gpu
  - this was introduced with fix_lb_rigid_pc_sphere
  - recomputed fluid force used to recompute equilibrium distribution too
- Leaky borders
  - force stencils can extend into walls
  - pits derivative isn't one-sided along walls (is in CPU code, but doesn't matter as kappa_lb=0)
- pressurebcx treats duplicate end point differently causing ghost point divergence
[X] Update pressurebcx to use symmetrical
- Use the same point on both sides with adjustwent on both left and right on both
- Update to the newer density adjustment method
[X] Save restart file at end point like particles are saved
- Need to save velocity as well as updated by fluid_param_u_correct_0
- Need to update distribution velocities to give final step restart (see cpu code)

[X] Leaky edges with pits (shows as dependency on number of ghost points transfered)

GPU bbf only bounces back diagonals on edges if both components are wall normal

orientation 7 isn't bouncing back directions: 9-13 (likely reversed with 6)
orientation 20 isn't bouncing back directions: 7-8, 13-14 (likely reversed with 19)

ORI	Wall Normal Constructive Solid Geometry	Bounce Back Components	Original (if different)
0	`-`	`-`
1	`(+1, 0, 0)`	`1, 7,10,11,14`
2	`( 0,+1, 0)`	`2, 7, 8,11,12`
3	`( 0, 0,+1)`	`5, 7, 8, 9,10`
4	`(+1, 0, 0)` and `( 0, 0,+1)`	`7,10`	`1, 5, 7,10`
5	`(+1, 0, 0)` or `( 0, 0,+1)`	`1, 5, 7, 8, 9,10,11,14`
6	`( 0,+1, 0)` and `( 0, 0,+1)`	`7, 8`	`2, 5, 7, 8, 9,10,11,12`
7	`( 0,+1, 0)` or `( 0, 0,+1)`	`2, 5, 7, 8, 9,10,11,12`	`2, 5, 7, 8`
8	`(-1, 0, 0)` or `( 0,-1, 0)`	`3, 4, 8, 9,10,12,13,14`
9	`(-1, 0, 0)` or `( 0,+1, 0)`	`2, 3, 7, 8, 9,11,12,13`
10	`(+1, 0, 0)` and `( 0, 0,+1)` or `( 0,+1, 0)` and `( 0, 0,+1)`	`7, 8,10`
11	`(+1, 0, 0)` and `( 0, 0,+1)` or `( 0,-1, 0)` and `( 0, 0,+1)`	`7, 9,10`
12	`(-1, 0, 0)` or `( 0,+1, 0)` or `( 0, 0, +1)`	`2, 3, 5, 7, 8, 9,10,11,12,13`
13	`(-1, 0, 0)` or `( 0,-1, 0)` or `( 0, 0, +1)`	`3, 4, 5, 7, 8, 9,10,12,13,14`
14	`(-1, 0, 0)`	`3, 8, 9,12,13`
15	`( 0,-1, 0)`	`4, 9,10,13,14`
16	`( 0, 0,-1)`	`6,11,12,13,14`
17	`(-1, 0, 0)` and `( 0, 0, 1)`	`8, 9`	`3, 5, 8, 9`
18	`(-1, 0, 0)` or `( 0, 0, 1)`	`3, 5, 7, 8, 9,10,12,13`
19	`( 0,-1, 0)` and `( 0, 0, 1)`	`9,10`	`4, 5, 7, 8, 9,10,13,14`
20	`( 0,-1, 0)` or `( 0, 0, 1)`	`4, 5, 7, 8, 9,10,13,14`	`4, 5, 9,10`
21	`( 1, 0, 0)` or `( 0,-1, 0)`	`1, 4, 7, 9,10,11,13,14`
22	`( 1, 0, 0)` or `( 0, 1, 0)`	`1, 2, 7, 8,10,11,12,14`
23	`(-1, 0, 0)` and `( 0, 0, 1)` or `( 0, 1, 0)` and `( 0, 0, 1)`	`7, 8, 9`
24	`(-1, 0, 0)` and `( 0, 0, 1)` or `( 0,-1, 0)` and `( 0, 0, 1)`	`8, 9,10`
25	`( 1, 0, 0)` or `( 0, 1, 0)` or `( 0, 0, 1)`	`1, 2, 5, 7, 8, 9,10,11,12,14`
26	`( 1, 0, 0)` or `( 0,-1, 0)` or `( 0, 0, 1)`	`1, 4, 5, 7, 8, 9,10,11,13,14`
27	`( 0, 1, 0)` or `( 0, 0,-1)`	`2, 6, 7, 8,11,12,13,14`
28	`( 0,-1, 0)` or `( 0, 0,-1)`	`4, 6, 9,10,11,12,13,14`
29	`( 1, 0, 0)` and `( 0, 0, 1)` or `( 0, 1, 0)`	`2, 7, 8,10,11,12`
30	`( 1, 0, 0)` and `( 0, 0, 1)` or `( 0,-1, 0)`	`4, 7, 9,10,13,14`
31	`(-1, 0, 0)` and `( 0, 0, 1)` or `( 0, 1, 0)`	`2, 7, 8, 9,11,12`
32	`(-1, 0, 0)` and `( 0, 0, 1)` or `( 0,-1, 0)`	`4, 8, 9,10,13,14`

[X] Clean up old branches (just master from main tyson)
[X] Add vector outputs as in current non-GPU code [2/2]
- See CPU compute_vector and documented at end of initial comments
- [X] Scalar is temperature, isn't working 100% can skip
  - [X] Add atom_kinetic (temperature) mirroring atom_spread (allocation, etc.)
  - [X] Need to unsort these (don't actually as just need to sum)
  - [X] Compute degrees of freedom for atom_spread
- [X] Vector (length 4) is total mass and total momentum
[X] Resync with upstream LAMMPS (make some notes on this)
- need to build the include file for the kernel code (could include it)
[X] Update IO, parsing, and string handling

9. DONE Selection of GPU to use `[2/2]`

[X] Specify which OpenCL devices to use
- Default selection
- Hostname based overrides
[X] Print OpenCL devices used

10. DONE Implement new two pass interpolation `[4/4]`

Allows exact node mass calculation (non-set gamma)
Mass ratio requires particle and fluid mass
Can re-weight forces on particle
[X] Compute the normalization weights
- Can do the weight interpolation in inital_integrate if we ensure nve goes first (need to warn user if not the case)
- Or just put it in the post_integrate (put it in pre_force for now as haven't hooked post_integrate)
[X]
Get forces onto GPU for spreading it between fluid and particles to tie velocities together
- Force on particle k = - hydroforce + m_particle/(m_particle + m_stencil_fluid) F_{particle-particle_k}
- Force on fluid from particle k = hydroforce + m_stencil_fluid/(m_particle + m_stencil_fluid) F_{particle-particle_k}
- Can just store the first as the difference works out to F_{particle-particle_k}
- stencil_density*area * dm_lb is m_stencil_fluid (at end of compute gamma interaction factor comment)
- The lbviscous routine then would just overwrite the force
https://www.sciencedirect.com/science/article/pii/S0010465522000364?via%3Dihub
- Need interpolated fluid mass for CPU calculation (both for force calculation and for temperature)
- Need to send LAMMPS forces to GPU for force calculation
[X] Redo initialization and restart to maintain momentum
[X] Implement gamma scaling and negative value hack

11. TODO Resync with reference CPU code `[2/4]`

[X] Linear initialization
[X] Stencils [2/2]
- [X] Add IBM3
- [X] Replace Peskin with Keys

[-] Remove higher order variant [1/2]

[X] Remove explicit code

[ ]

Remove extra transfers

fluid_dist_next_2: fluid_dist_eq_3_{old,new}, fluid_dist_eqn_3_old
fluid_vcm_remove_0: depends on what is needed

Routine		Input Buffers	Source		Output Buffers
RESTART
`InitializeFirstRun`					`fluid_force_2` [..0]
					`fluid_dist_3` [..0]
					`fluid_dist_eq_3` [..0]
					`fluid_dist_eqn_3` [..0]
`fluid_dist_sync_13_start`		`fluid_dist_3` [-G-S-F..-1]	`InitializeFirstRun` [..0]		`fluid_dist_3` [-G-S-F..-1]
`fluid_dist_eq_sync_13_start`		`fluid_dist_eq_3` [-G-S-F..-1]	`InitializeFirstRun` [..0]		`fluid_dist_eq_3` [-G-S-F..-1]
`fluid_dist_eqn_sync_13_start`		`fluid_dist_eqn_3` [-G-S-F..-1]	`InitializeFirstRun` [..0]		`fluid_dist_eqn_3` [-G-S-F..-1]
SETUP
`fluid_dist_eq_sync_13_finish` (next) (vcm)		`fluid_dist_eq_3` [..0] (next) (vcm)			`fluid_dist_eq_3` [..G+S+F] (next) (vcm)
		`fluid_dist_eq_3` [1..G+S+F] (next) (vcm)
`fluid_dist_sync_13_finish`		`fluid_dist_3` [..0]			`fluid_dist_3` [..G+S+F]
		`fluid_dist_3` [1..G+S+F]
manual reset		`fluid_dist_3` [..G+S+F]		X	`fluid_dist_3` [..G+S+F] (vcm)
`fluid_param_next_2`		`fluid_dist_3` [..G+S] (vcm)		X	`fluid_density_2` [..G+S/D]
					`fluid_velocity_2` [..G+S] (half) (vcm)
`atom_nonforce_write_1`		`atom` [..G]			`atom_position2` [..G]
					`atom_velocity` [..G]
					`atom_type` [..G]
					`atom_mass` [..G]
`atom_orders_compute_1`		`atom_position2` [..G]			`atom_position2_z` [..G]
					`atom_index` [..G]
`atom_nonforce_sort_1`		`atom_position2` [..G]			`atom_position2` [..G] (sorted)
		`atom_velocity` [..G]			`atom_velocity` [..G] (sorted)
		`atom_mass` [..G]			`atom_mass` [..G] (sorted)
		`atom_type` [..G]			`atom_type` [..G] (sorted)
		`atom_position2_z` [..G]			`atom_position2_z` [..G] (sorted)
		`atom_index` [..G]		X	`atom_index` [..G] (sorted)
`fluid_weight_sum_2`		`atom_position2` [..G] (sorted)			`fluid_weight_2` [..G+S] (local)
		`atom_position2_z` [G..] (sorted)
`fluid_weight_sync_22_start`		`fluid_weight_2` [-G-S..G+S] (local)			`fluid_weight_2` [-G-S..G+S] (local)
`atom_force_write_1`		`atom` [..G]			`atom_force` [..G]
`atom_force_sort_1`		`atom_force` [..G]			`atom_force` [..G] (sorted)
		`atom_index` [..G] (sorted)
`fluid_weight_sync_22_finish`		`fluid_weight_2` [..G+S] (local)			fluid_weight₂ [-G-S..G+S]
		`fluid_weight_2` [-G-S..G+S] (remote)
`atom_force_next_1`		`fluid_weight_2` [..G+S]			`atom_spread` [..G] (sorted)
		`fluid_density_2` [..G+S]			`atom_kinetic` [..G] (sorted)
		`fluid_velocity_2` [..G+S] (half) (vcm)			`atom_force` [..G] (fluid) (sorted)
		`atom_position2` [..G] (sorted),
		`atom_velocity` [..G] (sorted)
		`atom_mass` [..G] (sorted)
		`atom_type` [..G] (sorted)
		`atom_force` [..G] (sorted)
`fluid_force_next_2` (half)		`atom_position2` [..G] (sorted)			`fluid_force_2` [..G+S] (local)
		`atom_position2_z` [..G] (sorted)
		`atom_spread` [..G] (sorted),
		`atom_force` [..G] (fluid) (sorted)
		`fluid_weight_2` [..G+S]
		`fluid_force_2` [..0]
`fluid_force_sync_20_start`		`fluid_force_2` [0..G+S] (local)			`fluid_force_2` [0..G+S] (local)
`fluid_force_sync_20_finish`		`fluid_force_2` [0..G+S] (local)		X	`fluid_force_2` [..0]
		`fluid_force_2` [-G-S..0] (remote)
`fluid_param_u_correct_0`		`fluid_density_2` [..0]		X	`fluid_velocity_2` [..0] (vcm)
		`fluid_velocity_2` [..0] (half) (vcm)
		`fluid_force_2` [..0]
`atom_force_unsort_1`		`atom_force` [..G] (fluid) (sorted)		X	`atom_force` [..G] (fluid)
	X	`atom_index` [..G] (sorted)
`atom_force_read_1`	X	`atom_force` [..G] (fluid)	`atom_force_unsort_1` [..G] (fluid)	X	`atom` [..G]
manual reset	X	`fluid_velocity_2` [..0] (vcm)	`fluid_param_next_2` [..G+S] (half) (vcm) -half-?	X	`fluid_velocity_2` [..0]
manual reset	X	`fluid_dist_3` [..G+S+F] (vcm)	manual reset [..G+S+F] (vcm)	X	`fluid_dist_3` [..G+S+F]
`fluid_dist_eq_next_0`	X	`fluid_density_2` [..D]	`fluid_param_next_2` [..G+S/D]	X	`fluid_dist_eq_3` [..0] (next) (vcm)
	X	`fluid_velocity_2` [..0]	manual reset [..0]
	X	`fluid_force_2` [..0]	`fluid_force_sync_20_finish` [..0]
`fluid_dist_eq_sync_13_start`	X	`fluid_dist_eq_3` [-G-S-F..-1] (next) (vcm)	`fluid_dist_eq_next_0` [..0] (next) (vcm)	X	`fluid_dist_eq_3` [-G-S-F..-1] (next) (vcm)
INITIAL INTEGRATE
`fluid_dist_sync_13_finish` (prev)	X	`fluid_dist_3` [..0] (prev)	`fluid_vcm_remove_0` [..0] (prev)	X	`fluid_dist_3` [..G+S+F] (prev)
	X	`fluid_dist_3` [1..G+S+F] (prev)	`fluid_dist_sync_13_start` [-G-S-F..1] (prev)
`fluid_dist_eq_sync_13_finish` (vcm)	X	`fluid_dist_eq_3` [..0] (vcm)	`fluid_dist_eq_next_0` [..0]	X	`fluid_dist_eq_3` [..G+S+F] (vcm)
	X	`fluid_dist_eq_3` [1..G+S+F] (vcm)	`fluid_dist_eq_sync_13_start` [-G-S-F..1] (vcm)
`fluid_dist_eqn_next_3`	X	`fluid_dist_eq_3` [..G+S+F] (vcm)	`fluid_dist_eq_sync_13_finish` [1..G+S+F] (vcm)	X	`fluid_dist_eqn_3` [..G+S+F] (vcm)
`fluid_dist_next_2`	X	`fluid_dist_3` [..G+S+F] (prev)	`fluid_dist_sync_13_finish` [0..G+S+F] (prev)	X	`fluid_dist_3` [..G+S] (vcm)
	X	`fluid_dist_eqn_3` [..G+S+F] (vcm)	`fluid_dist_eqn_next_3` [..G+S+F]
`fluid_param_next_2`	X	`fluid_dist_3` [..G+S] (vcm)	`fluid_dist_next_2` [..G+S] (vcm)	X	`fluid_density_2` [..G+S/D]
				X	`fluid_velocity_2` [..G+S] (half) (vcm)
PRE FORCE
`atom_nonforce_write_1`	X	`atom` [..G]	PRE FORCE	X	`atom_position2` [..G]
				X	`atom_velocity` [..G]
				X	`atom_type` [..G]
				X	`atom_mass` [..G]
`atom_orders_compute_1`	X	`atom_position2` [..G]	`atom_nonforce_write_1` [..G]	X	`atom_position2_z` [..G]
				X	`atom_index` [..G]
`atom_nonforce_sort_1`	X	`atom_position2` [..G]	`atom_nonforce_write_1` [..G]	X	`atom_position2` [..G] (sorted)
	X	`atom_velocity` [..G]	`atom_nonforce_write_1` [..G]	X	`atom_velocity` [..G] (sorted)
	X	`atom_mass` [..G]	`atom_nonforce_write_1` [..G]	X	`atom_mass` [..G] (sorted)
	X	`atom_type` [..G]	`atom_nonforce_write_1` [..G]	X	`atom_type` [..G] (sorted)
	X	`atom_position2_z` [..G]	`atom_orders_compute_1` [..G]	X	`atom_position2_z` [..G] (sorted)
	X	`atom_index` [..G]	`atom_orders_compute_1` [..G]	X	`atom_index` [..G] (sorted)
`fluid_weight_sum_2`	X	`atom_position2` [..G] (sorted)	`atom_nonforce_sort_1` [..G]	X	`fluid_weight_2` [..G+S] (local)
	X	`atom_position2_z` [G..] (sorted)	`atom_nonforce_sort_1` [..G]
`fluid_weight_sync_22_start`	X	`fluid_weight_2` [-G-S..G+S] (local)	`fluid_weight_sum_2` [..G+S] (local)	X	`fluid_weight_2` [-G-S..G+S] (local)
POST FORCE
`atom_force_write_1`	X	`atom` [..G]	POST FORCE	X	`atom_force` [..G]
`atom_force_sort_1`	X	`atom_force` [..G]	`atom_force_write_1` [..G]	X	`atom_force` [..G] (sorted)
	X	`atom_index` [..G] (sorted)	`atom_nonforce_sort_1` [..G]
`fluid_weight_sync_22_finish`	X	`fluid_weight_2` [..G+S] (local)	`fluid_weight_sum_2` [..G+S] (local)	X	`fluid_weight_2` [..G+S]
	X	`fluid_weight_2` [-G-S..G+S] (remote)	`fluid_weight_sync_22_start` [-G-S..G+S] (local)
`atom_force_next_1`	X	`fluid_weight_2` [..G+S]	`fluid_weight_sync_22_finish` [..G+S]	X	`atom_spread` [..G] (sorted)
	X	`fluid_density_2` [..G+S]	`fluid_param_next_2` [..G+S/D]		`atom_kinetic` [..G] (sorted)
	X	`fluid_velocity_2` [..G+S] (half) (vcm)	`fluid_param_next_2` [..G+S] (half) (vcm)	X	`atom_force` [..G] (fluid) (sorted)
	X	`atom_position2` [..G] (sorted)	`atom_nonforce_sort_1` [..G] (sorted)
	X	`atom_velocity` [..G] (sorted)	`atom_nonforce_sort_1` [..G] (sorted)
	X	`atom_mass` [..G] (sorted)	`atom_nonforce_sort_1` [..G] (sorted)
	X	`atom_type` [..G] (sorted)	`atom_nonforce_sort_1` [..G] (sorted)
	X	`atom_force` [..G] (sorted)	`atom_force` [..G] (sorted)
`fluid_force_next_2`	X	`atom_position2` [..G] (sorted)	`atom_nonforce_sort_1` [..G] (sorted)	X	`fluid_force_2` [..G+S] (local)
	X	`atom_position2_z` [..G] (sorted)	`atom_nonforce_sort_1` [..G] (sorted)
	X	`atom_spread` [..G] (sorted)	`atom_force_next_1` [..G] (sorted)
	X	`atom_force` [..G] (fluid) (sorted)	`atom_force_next_1` [..G] (fluid) (sorted)
	X	`fluid_weight_2` [..G+S]	`fluid_weight_sync_22_finish` [..G+S]
`fluid_force_sync_20_start`	X	`fluid_force_2` [0..G+S] (local)	`fluid_force_next_2` [..G+S] (local)	X	`fluid_force_2` [0..G+S] (local)
`fluid_force_sync_20_finish`	X	`fluid_force_2` [..0] (local)	`fluid_force_next_2` [..G+S] (local)	X	`fluid_force_2` [..0]
	X	`fluid_force_2` [-G-S..0] (remote)	`fluid_force_sync_20_start` [0..G+S] (local)
`fluid_param_u_correct_0`	X	`fluid_density_2` [..0]	`fluid_param_next_2` [..G+S/D]	X	`fluid_velocity_2` [..0] (vcm)
	X	`fluid_velocity_2` [..0] (half) (vcm)	`fluid_param_next_2` [..G+S] (half) (vcm)
	X	`fluid_force_2` [..0]	`fluid_force_sync_20_finish` [..0]
`atom_force_unsort_1`	X	`atom_force` [..G] (fluid) (sorted)	`atom_force_next_1` [..G] (fluid) (sorted)	X	`atom_force` [..G] (fluid)
	X	`atom_index` [..G] (sorted)	`atom_nonforce_sort_1` [..G] (sorted)
`atom_force_read_1`	X	`atom_force` [..G] (fluid)	`atom_force_unsort_1` [..G] (fluid)	X	`atom` [..G]
FINAL INTEGRATE
END OF STEP
`vcm_total_calc_0`	X	`atom` [..G]	END_OF_STEP	X	`vcm_total`
	X	`fluid_density_2` [..0]	`fluid_param_next_2` [..G+S/D]
	X	`fluid_velocity_2` [..0] (vcm)	`fluid_param_u_correct_0` [..0] (vcm)
`fluid_vcm_remove_0`	X	`vcm_total`	`vcm_total_calc_0`	X	`fluid_velocity_2` [..0]
	X	`fluid_velocity_2` [..0] (vcm)	`fluid_param_u_correct_0` [..0] (vcm)	X	`fluid_dist_3` [..0]
	X	`fluid_dist_3` [..0] (vcm)	`fluid_dist_next_2` [..G+S] (vcm)		`fluid_dist_eq_3` [..0]
		`fluid_dist_eq_3` [..0] (vcm)	`fluid_dist_eq_next_0` [..0] (vcm)		`fluid_dist_eqn_3` [..0]
		`fluid_dist_eqn_3` [..0] (vcm)	`fluid_dist_eqn_next_3` [..G+S+F] (vcm)
`restartWrite`		`fluid_force_2` [..0]	`fluid_force_sync_20_finish` [..0]
		`fluid_dist_3` [..0]	`fluid_vcm_remove_0` [..0]
		`fluid_dist_eq_3` [..0]	`fluid_vcm_remove_0` [..0]
		`fluid_dist_eqn_3` [..0]	`fluid_vcm_remove_0` [..0]
`fluid_dist_sync_13_start`	X	`fluid_dist_3` [-G-S-F..-1]	`fluid_vcm_remove_0` [..0]	X	`fluid_dist_3` [-G-S-F..-1]
`fluid_dist_eq_next_0` (next)	X	`fluid_density_2` [..D]	`fluid_param_next_2` [..G+S/D]	X	`fluid_dist_eq_3` [..0] (next) (vcm)
	X	`fluid_velocity_2` [..0]	`fluid_vcm_remove_0` [..0]
	X	`fluid_force_2` [..0]	`fluid_force_sync_20_finish` [..0]
`fluid_dist_eq_sync_13_start` (next) (vcm)	X	`fluid_dist_eq_3` [-G-S-F..-1] (next) (vcm)	`fluid_dist_eq_next_0` [..0] (next) (vcm)	X	`fluid_dist_eq_3` [-G-S-F..-1] (next) (vcm)

[ ] Remove explicit equilibrium distribution

12. DONE Discovered code fixes `[7/7]`

[X] Missed one-processor GPU boundary exchange optimization for fluid_force
[X] Missing OpenCL event free on most kernel calls
[X] Remove momentum code is broken
- [X] Only waits on OpenCL events (needs to correctly handle MPI)
- [X] fluid_momentum_remove should be using fluid_velocity_2 and fluid_density_2 so it sees the results of fluid_correctu_next_0
- [X] Fix should just set flag to enable an not call directly (to get in correct spot)
- [X] Verify step passed in is correct (should it be step or step+1 – former is correct)
- [X] Kernel defines dist_3_new = ... (step & 0x01) ... where new should actually be be step+1 & 0x01
- [X] Test periodic boundary conditions
  - call fix momentum, should leave it zero
  - add small body force, remove momentum every 10 steps, should see grow and then drop
[X] Fix lb/fluid/rigid/pc/sphere `omega` calculation diverges a bit on restart
- likely issue with `setup` `omega` calculation vs `initial/final_integrate` one
- replaced by the standard rigid fix (gone in next iteration)
[X] fluid_correctu_next_2 is called when we only have fluid_force_2 [0] boundaries
- verified with Colin and should only be calculating on interior fluid_correctu_next_0
[X] Likely shifts momentum in first step due to InitializeFirstRun computing fluid force
- this would be applied in the first 1/2 step
- for consistency with other other lammps integrators, should be zero for this
- first calculation should occur in post force routine
[X] Not getting good local size from kernel_local_box
- Working correctly, was limited by the 64 work groups per compute unit requirement

13. TODO Code cleanups `[15/31]`

[ ] fluid_dist_eqn_next_3 has duplicate non-local prefixed n3, stide3, and offets3 variables
[X] Could use some whitespace clean as there is some tailing whitespace and a mix of tabs vs spaces
[ ] Should make the offseting in the ?boundaries_* gpu routines more readable (same style as others)
[X] Expect the (__global realType*) casts in new gpu code are not required
- Are actually as it is loading a single component from variable locations
[ ] See if now gpu bounce back code could be written in a nicer vectorized way
[ ] Checking queue properties of devices should use existing pre-wrapped error check versions
[X] Drop local me parameter (currently a mix of local me and comm->me)
[ ] Proper setting of typeLB is not verifed (required for things like initialization of K0)
[ ] gridgeo_2 memory should be const in all (most?) GPU kernels
[X] Backtrace cleanup is a mess and printing garbage [3/3]
- Need to compile with the -rdynamic option to not strip out static function names
- [X] Bug due to backtrace allocating one chunk for strings and pointers
- [X] Exception safety via C++11 unique_ptr rewrite
- [X] Figure out why function names are included
[X] Fix differences between fluid dumping and atom dumping (via dump routines)
- Dump code (ntimesteps based) and fluid code (step based) are off by one in their
  - update->ntimestep is 0 in setup and then 1 for the first step
  - step is -1 in setup and then 0 for first step
- Dump code dumps in both setup and regular steps while fluid code only does regular steps
[ ] Check with Colin about treating sw (y sidewalls) boundary the same as z one (currently one point short)
[ ] Rename lattice variables to match GPU geogrid name
[ ] Base lattice size calculations on gpu size variables and not subNb{x,y,z} ones

[X]

Make code cleaner by using post_run (can get rid of everything but send bit in setup)

not doing as isn't cleaner due to setup (or init) mismatch with post_run
- setup is always run the first time and then only if the run pre flag is set
- post_run is always run
how to handle calculation of forces in setup (1/2 step velocity is no longer available)
- setup compute the forces
  - no need for momentum remaval (user will have disabled if they have introduce net momentum)
- setup minimal, we are okay, no recomputations required

constructor	`setup`	`initial_integrate`	…	`final_integrate`	`post_run`	destructor
`init_dist_*` new
`send_dist_*` new		`recv_dist_* old`
`send_dist_eq` new	`drop_dist_eq` new+
	`calc_force`
	`xchg_force`
	`calc dist_eq` new+
	`send_dist_eq` new+	`recv_dist_eq new`
			…
				send `dist_*` new		`recv_dist_*` new
				calc `dist_eq` new+
				send `dist_eq` new+		`recv_dist_eq` new+
`init_dist_*` new		`if not ~post_run`
	`send dist_*` new	: `recv_dist_*` old
	`calc_force`
	`xchg_force`
	`calc dist_eq` new+
	`send_dist_eq` new+	: `recv_dist_eq` new
			…
				send `dist_*` new	`recv_dist_*` new
				calc `dist_eq` new+
				send `dist_eq` new+	`recv_dist_eq` new+

[ ] Should be able to disable profiling as may reduce performance
[ ] Can probably drop explicit std::string constructors
[X] Add ghost points suffix to fluid_momentum_remove_kernel
- Was renamed to fluid_vcm_remove_0
[X] Clean up gpu boundary exchange naming to match others
[ ] Cleanup compilation warnings
- [ ] Fix size_t related narrowing conversion warnings
[X] Combine profile averages and only output from first rank
[X] fluid_correctu_next_2 should be just fluid_correctu_next_0 as only fluid_force_2 [0] is available
[X] Clarify components acted upon in write, rewrite, sort, resort, and read in names
- Made in final version of action code
[ ] Deduplicate buffer specific inner syncing with macros like outer ones
[X] sync_inner_* routines' inner target buffers are larger than they need to be (outside_offset_*[1] = mem_border)
[X] atom_force_next should clip stencil to avoid past end access if atom goes out of bounds
[X] Floating point constants without f suffix wull causing calculation to be in double
[X] Consistent handling and naming of boundary conditions
- Prefix with effect (wall, pressure) and
- Switch to global location condition inside of GPU
[ ] Could skip atom_force_read_1 if gamma scalling is all negative for all atom types
[ ] Should switch atom_force (combined) from fluid to atom units to avoid confusion
[ ] Replace long with bigint

14. TODO Documentation updates `[0/4]`

[ ] need documentation as to why the routines
[ ] and a description of the comments
[ ] add tables to documentation (describe the cloumns and the bracketed terms)
[ ] note about the HEX file needing to be there for the build

15. TODO Investigate potential GPU options before switch to HIP `[5/12]`

[X] Full profile information dump [2/2]
- [X] Switch from total runtimes to individual time points
- [X] Output profile information
[X] Remove waits from sorting routines [1/1]
- should we stick all the event handlers in a queue
- [X]
  Try on atom_sort_position2_z_* routines
  
  operation wait between wait at end speedup
  
  sort_..._bb 7.57 5.98 21.0%
  
  sort_..._bm 6.82 5.32 22.0%
  
  combined wall 7.71 5.29 31.4%
[ ] Test run with weight set to 1 and with trilinear
- Is the atom searching or the stencil costing the most
[ ] Can optimize complete map writing with CL_MAP_WRITE_INVALIDATE_REGION instead of CL_MAP_WRITE
[ ] Mapping pinned memory should give the fastest access
- NVIDIA OpenCL optimization guide explains how to suggest pinning
[ ] Use memory mapping to copy buffers
- fluid distribution reads and force array
- may be able to work with unified memory
[X] Record starting and stopping times for profiling to determine key path holdups
[ ] Non-blocking GPU operations do not nessesarily execute unless there is a flush
- Likely okay under NVIDIA except for Windows (from NVIDIA OpenCL optimization guide)
- Could be a reason for the slow down under AMD
[ ] Where would it be worthwhile to use local memory
[X] Sometimes fluid_force_next_2 is 3.7x slower (1.03 to 3.81ms)
- tracked down to broken riser card on gra986 asserting hardware power brake on GPU card
[ ] Check regular code on thread safe MPI
- Cluster OpenMPI supports thread if initialized with MPI_Init_thread instead MPI_Init as in LAMMPS
- Doesn't seem to be any significant slowdown from replacing MPI_Init with MPI_Init_thread
[X] Use events between groups

operation	wait between	wait at end	speedup
`sort_..._bb`	7.57	5.98	21.0%
`sort_..._bm`	6.82	5.32	22.0%
combined wall	7.71	5.29	31.4%

16. Item for future

Explore some ideas around geogrid code
- Use bit vector to track bounce back status for each cell
  - Could store in texture memory as constant for entire simulation
- Arbitrary geometry support by loading mask of fluid vs non-fluid
  - could the new gridgeo system be used to subsume the wall bounce back code (moving walls likely an issue)
Resync with upstream lammps
- now includes a cmake build system
Cleanup old subNbx initialization calculations
- had intended to in prior DP, but never got around to it
Changes in force to handle multiple particles contributing to a grid point better
- Can likely incorporate into current one pass on GPU
Better CPU/GPU border exchange code (current all or nothing)
verify GPU options work correctly

Colin DP

Table of Contents

1. DONE Review/get up to speed on changes [4/4]

2. DONE Merge in MPI code [10/10]

3. DONE Add GPU (and CPU) profiling [8/8]

4. DONE Profile analysis system [9/9]

5. DONE Profile discovered optimizations/fixes [5/5]

6. DONE Free up more concurrency [4/4]

7. DONE Change ordering of array in GPU version [12/12]

8. DONE Resync/compare with reference code/LAMMPS [8/8]

9. DONE Selection of GPU to use [2/2]

10. DONE Implement new two pass interpolation [4/4]

11. TODO Resync with reference CPU code [2/4]

12. DONE Discovered code fixes [7/7]

13. TODO Code cleanups [15/31]

14. TODO Documentation updates [0/4]

15. TODO Investigate potential GPU options before switch to HIP [5/12]