Colin DP

Table of Contents

Scratch space for tracking progress and keep track of items on Colin's DP.

1. DONE Review/get up to speed on changes [4/4]

  • [X] Added variety of items to cleanups todo
  • [X] Reviewed the new boundary items and gridgeo structure (excluding the pit setup code)
    • type
      • 0- bounce backs (bulk fluids nodes)
      • 1- bounce backs (pit geometry boundary fluid nodes)
      • 2- not in fluid
    • orientation (number of bounce backs)
      • 0- no bounce back (couldn't this be used for bulk fluid?)
      • 3- not sure – all the diagonals on a face minus 1?
      • 4- not sure – two faces (inside edge) minus all furthest diagonals?
      • 5- against a face
      • 6- not sure – seems to be face and one of the opposite diagonals?
      • 8- against two faces (inside edge)
      • 10 - against three faces (inside corner)
  • [X] Make tyson branch master
    • Master branch (fracnes/final-fixes tag/frances branch + one commit from Colin) is a subset of tyson branch.
  • [X] Figure out where MPI branch is at (sync with tyson)
    • MPI branch only uncomments the transfer and exchange calls.

2. DONE Merge in MPI code [10/10]

  • [X] Verify that the MPIORDERC,FORTRAN stuff is okay (per comment there may be an issue): pass
  • [X] Check if woundup with both GPU and CPU border exchange active: pass
  • [X] Fix platform initialization on later CUDAs (CPU platform won't iterate)
  • [X] Check that boundary exchange works with at least two processors along each dimension: fail [3/3]
    • Need to set comm_modify cutoff to 2.5x dx to have required particles
    • [X] Periodic MPI deadlocks in first steps
    • [X] Crash at end-of-simulation
    • [X] Split along z is okay, split along y has agreement issues, and split along z looses atoms
  • [X] Fixed bug causing periodic MPI boundary exchange lockups
    • fixviscouslb wasn't initialized if unused so fluid force could occur at different times
  • [X] Fix end-of-simulation crash
  • [X] Double checked no other local state variables are unset: pass
  • [X] Switch boardary exchange to be GPU if only one processor and MPI otherwise
  • [X] Look into potential issues with fluid distribution exchange being in flight: pass
    • Constructor sets bogus fluid distribution exchange in flight
    • setup (called at the start of a run) finishes the one if flight, throws it away, computes the proper one for the run, and starts it in flight
    • Step routines expert fluid distribution in flight at start and put it inflight at end
    • Destructor finishes fluid distribution exchange in flight
  • [X] Debug and fix disagreement between simulation results for MPI and non-MPI runs (two processor splits)
    • [X] Get particle dumping working to visualize in paraview
    • [X] Particles are lost when splitting x
      • Code for pressurebcx looks like it incorrectly applies to the internal side too for boundary processes
    • [X] Fix pressurebcx applying to internal side for boundary processes
    • [X] Forces are off when splitting x or y [9/9]
      • [X] Test with individual atoms, bodyforce, and no pressurebcx: fail
        • Issue when split along fixed boundary side, otherwise perfect agreement
      • [X] Test with rigid sphere, bodyforce, and no pressurebcx: pass
      • [X] Test with rigid sphere, bodyforce, no pressurebcx, and pits: pass
      • [X] Test without particle interactions: fail (not caused by particle interaction code)
      • [X] Figure out how to do paraview visualization of differences
        • Append attributes filter on multiple inputs and then calculator for relative error
        • Minimal size test example
        • Error starts in split and then jumps to ends as well
      • [X] Double checked EDGE_Z{0,1} usage for acting on both sides
      • [X] Test with all edge code disabled: failed
      • [X] Test with 3 processors along z: error on both boundaries, starting on left outer one: error on both sides of ends
      • [X] Looks into end extra point old adjustment code
    • [X] Figure out why splitting along fixed boundary gives different results for MPI [3/3]
      • [X] Add boundary dump code after each GPU routine call
      • [X] Create a serial vs parallel dump comparison program
      • [X] Extend boundary dump to entire field
    • [X] Fix discovered geogrid issue (requires 3 boundary points and not 2)
      • [X] Add additional boundary point to calculation
        • [X] sublattice initialization code uses wholelattice wrap of Nbz: correct as only applies to boundary
        • [X] Update dump routines
        • [X] Update gpu routines to use 3 boundary offsets value and names

3. DONE Add GPU (and CPU) profiling [8/8]

  • [X] Read up on profiling options
  • [X] Turn profiling one in clCreateCommandQueue
  • [X] Add time tracking to kernels [18/18]
    • [X] fluid_dist_eq_initial_3_kernel
    • [X] fluid_dist_eq_next_0_kernel
    • [X] fluid_dist_eqn_next_3_kernel
    • [X] fluid_dist_initial_2_kernel
    • [X] fluid_dist_next_2_kernel
    • [X] fluid_param_initial_2_kernel
    • [X] fluid_param_next_2_kernel
    • [X] fluid_correctu_next_2_kernel
    • [X] xboundaries_fluid_dist_eq_kernel
    • [X] yboundaries_fluid_dist_eq_kernel
    • [X] zboundaries_fluid_dist_eq_kernel
    • [X] xboundaries_fluid_dist_kernel
    • [X] yboundaries_fluid_dist_kernel
    • [X] zboundaries_fluid_dist_kernel
    • [X] xboundaries_fluid_force_kernel
    • [X] yboundaries_fluid_force_kernel
    • [X] zboundaries_fluid_force_kernel
    • [X] remove_momentum_kernel
  • [X] Print profiling at end
  • [X] Add time tracking to memory read/writes [7/7]
    • [X] fluid_dist_3_1_interior_read
    • [X] fluid_dist_eq_3_3_interior_read
    • [X] fluid_force_2_exterior_read
    • [X] fluid_force_2_accumulate_read
    • [X] fluid_dist_3_1_exterior_write
    • [X] fluid_dist_eq_3_3_exterior_write
    • [X] fluid_force_2_accumulate_write
  • [X] Debug segfault introduced in unrelated OpenCL call [1/1]
    • [X] Break appart commits and bisect
      • Accidentally removed queue assignment in clCreateCommandQueue call
  • [X] Revamp profiling to fix leaks and reduce boilerplate [4/4]
    • [X] C++ template magic class to progressively push location information
      • Dreadful failure
    • [X] Abstract with profile records: location, rank, info, value
    • [X] Resolve race conditions/locking by dumping different ranks to different files
    • [X] Switch to binary format for space
      • Compact everything into a location field by combining bits
      • Reduce data required to pass between functions
      • Replace strings with enums
  • [X] Add CPU profile points for start and end of fix callouts and expensive CPU operations [8/8]
    • [X] GPU timer requires OpenCL 2.1 (no NVIDIA), use CPU timer instead
    • [X] initial_integrate
    • [X] pre_force
    • [X] post_force
    • [X] final_integrate
    • [X] atom_read
    • [X] atom_write
    • [X] fluid_force_accumulate

4. DONE Profile analysis system [9/9]

  • [X] Initial analysis for intra-timestep details
    • [X] Python code extract binary data format
    • [X] Switch to R as python pandas apply functionality too slow
      • nearly 8 minutes in python pandas vs 2 seconds in R tidyverse
      • don't have a good way to load data: use python to save as a feather file
    • [X] Initial plot of timestep
      • transfer and atom related routines seem to be taking most time
  • [X] Fix issues revealed in profile code
    • [X] Distinguish between two uses of fluid_dist_read/write routines
    • [X] Fix non-unique fluid_dist_read/write profile points (eq vs non-eq)
  • [X] Add group brackets to step breakout
  • [X] Distinguish between executing and non-executing state in intra analysis
  • [X] Add cross-timestep analysis [2/2]
    • [X] Plot of walltime for each step (group analysis)
      • Neighbour calculation is very expensive (x100 over regular step)
    • [X] Distributions of walltime for each group of GPU calls (inter analysis)
  • [X] Add reference to distribution to group analysis
  • [X] Integrate addition of CPU clocks
    • [X] Redo implicit ordering calculation code (no longer simple)
      • Add substep calculation as CPU and GPU out of order wrt substep
    • [X] Syncronize CPU and GPU clocks from data
      • Have to correct for slight clock skew as well as offset (linear model)
    • [X] Add special cases to handle missing events in CPU only profiles
  • [X] Mark run periods of all LAMMPS fix calls in analysis
  • [X] Drop python code required for initial loading of data

5. DONE Profile discovered optimizations/fixes [5/5]

  • [X] Switch atom_sort_index to write to recorded index position instead of resorting [10/10]
    • 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 96.5 -> 84.3 (-12.6%)
    • [X] Add memory for copy out space (put in end)
    • [X] Adjust memory resize routines
    • [X] Create new GPU function atom_unsort_position2_z and remove old ones
    • [X] Add kernel variables and initialize and deinitialize
    • [X] Add local and global size variables and initialize
    • [X] Calls for initial arguments
    • [X] Update all arguments (need to include force and index too!) on atom resize
    • [X] Add profile enums and remove old ones
    • [X] Call to invoke with final arguments
    • [X] Update atom_read
  • [X] Switch second atom_sort_position2_z to just gather new velocity values instead of resorting everything [10/10]
    • 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 84.3 -> 73.5 (-12.8%)
    • [X] Add memory for copy out space (put in end)
    • [X] Adjust memory resize routines
    • [X] Add new GPU function atom_resort_position2_z
    • [X] Add kernel variable and initialize and de-initialize
    • [X] Add local and global size variables and initialize
    • [X] Calls for initial arguments
    • [X] Update all arguments (need to include index too!) and local and global size on atom resize
    • [X] Add profile enum
    • [X] Call to invoke with final arguments
    • [X] Add atom_rewrite
  • [X] Investigate and fix duplicate running of atom related routines
    • 2 nodes x 1 cpu & gpu/processor (p100): 1000 steps of in.polymersphere loop time 73.5 -> 69.0 (-6.1%)
    • If no forces added via fix_lb_viscous_gpu and or fix_lb_rigid_pc_sphere_gpu could skip a bunch of atom routines including correct_u
    • The positions don't change so the sort mapping is fixed throughout
    • The 1/2 step velocity is the correct one to be using and not the 1 step as is being done
    • Push atom_force down down to post_force (more lammps standard and would like to be able to add forces into GPU routine)
    • Technically correct_u in is a final_integrate sort of thing so put it there
    • Steps overview

      Routine Operation
      fluid_dist_exchange_finish-1 Recieve fluid_dist_3_new- exterior [G+S+F] (for previous step)
      fluid_dist_eq_exchange_finish Recieve fluid_dist_eq_3_new exterior [1..G+S+F]
      fluid_dist_eqn_next_3 fluid_dist_eq_3_new [0..G+S+F] -> fluid_dist_eqn_3_new [0..G+S+F]
      fluid_dist_next_2 fluid_dist_~{eq,eqn}_3_{new,old} [0..G+S+F], ~fluid_dist_3_old [0..G+S+F], fluid_density_2 [..F] -> fluid_dist_3_new [0..G+S]
      restartWrite fluid_dist_eq_3_new [0..G+S+F], fluid_dist_3_new [0..G+S] -> restart file (if requested)
      fluid_dist_exchange_start Send fluid_dist_3_new interior [-G-S-F]
      fluid_param_next_2_start Start fluid_dist_3_new [0..G+S] -> fluid_velocity_2 [0..G+S], fluid_density_2 [0..G+S]
      fluid_param_next_2_finish End fluid_dist_3_new [0..G+S] -> fluid_velocity_2 [0..G+S], fluid_density_2 [0..G+S]
      atom_write Atom [G] -> atom_position2 [G], atom_velocity [G], atom_type [G], atom_mass [G]
      atom_position2_z_index atom_position2 [G] -> atom_position2_z [G], atom_index [G]
      atom_sort_position2_z Sort by z atom_position2 [G], atom_velocity [G], atom_mass [G], atom_type [G], atom_position2_z [G], atom_index [G]
      atom_force fluid_velocity_2 [0..G+S], fluid_density_2 [0..G+S], atom_position2 [G], atom_velocity [G], atom_mass [G], atom_type [G] -> atom_force [G]
      fluid_force_next_2 atom_position2 [G], atom_position2_z [G], atom_force [G] -> local fluid_force_2 [0..G+S]
      fluid_force_exchange_start Send local fluid_force_2 exterior [0..G+S]
      fluid_force_exchange_finish Recieve remote fluid_force_2 interior [-G-S..0]
        Local fluid_force_2 [0], remote fluid_force_2 [0] -> fluid_force_2 [0]
      fluid_correctu_next_2 fluid_velocity_2 [0..G+S], fluid_density_2 [0..G+S], fluid_force_2 [0..G+S] -> fluid_velocity_2 [0..G+S]
      fluid_dist_eq_next_0+1 fluid_velocity_2 [0], fluid_density_2 [0..D], fluid_force_2 [0] -> ~fluiddisteq3new~+ [0] (for next step)
      fluid_dist_eq_exchange_start+1 Send ~fluiddisteq3new~+ interior [-G-S-F..-1] (for next step)
      atom_sort_index/atom_unsort_position2_z Sort by index atom_force [G], atom_index [G]
      atom_read atom_force [G] -> hydroF [G]
    • Reorganization

          Pre-fix_lb_rigid_pc_sphere_gpu Post-fix_lb_rigid_pc_sphere_gpu Post-fix_lb_rigid_pc_sphere_gpu New
      Step Lammps   (with fix_lb_viscous_gpu) (without fix_lb_viscous_gpu)  
      initial_integrate coordinates 1 fluid_dist_exchange_finish-1 fluid_dist_exchange_finish-1 fluid_dist_exchange_finish-1 fluid_dist_exchange_finish-1
        velocities 1/2 fluid_dist_eq_exchange_finish fluid_dist_eq_exchange_finish fluid_dist_eq_exchange_finish fluid_dist_eq_exchange_finish
          fluid_dist_eqn_next_3 fluid_dist_eqn_next_3 fluid_dist_eqn_next_3 fluid_dist_eqn_next_3
          fluid_dist_next_2 fluid_dist_next_2 fluid_dist_next_2 fluid_dist_next_2
          restartWrite restartWrite restartWrite restartWrite
          fluid_dist_exchange_start fluid_dist_exchange_start fluid_dist_exchange_start fluid_dist_exchange_start
          fluid_param_next_2_start fluid_param_next_2_start fluid_param_next_2_start fluid_param_next_2_start
      post_integrate          
      pre_exchange          
      pre_neighbour          
      post_neighbour          
      pre_force   fluid_param_next_2_finish fluid_param_next_2_finish fluid_param_next_2_finish fluid_param_next_2_finish
          atom_write atom_write atom_write atom_write
          atom_position2_z_index atom_position2_z_index atom_position2_z_index atom_position2_z_index
          atom_sort_position2_z atom_sort_position2_z atom_sort_position2_z atom_sort_position2_z
          atom_force atom_force atom_force  
          fluid_force_next_2 fluid_force_next_2    
          fluid_force_exchange_start fluid_force_exchange_start    
          fluid_force_exchange_finish fluid_force_exchange_finish    
            fluid_correctu_next_2    
          fluid_dist_eq_next_0+1 fluid_dist_eq_next_0+1    
          fluid_dist_eq_exchange_start+1 fluid_dist_eq_exchange_start+1    
      pre_reverse          
      post_force additional forces atom_sort_index atom_sort_index atom_sort_index atom_force
          atom_read atom_read atom_read fluid_force_next_2
                fluid_force_exchange_start
                fluid_force_exchange_finish
                atom_unsort_position2_z
                atom_read
      final_integrate velocity 1       fluid_correctu_next_2
                fluid_dist_eq_next_0+1
                fluid_dist_eq_exchange_start+1
      end_of_step       atom_write  
              atom_position2_z_index  
              atom_sort_position2_z  
              atom_force  
              fluid_force_next_2  
              fluid_force_exchange_start  
              fluid_force_exchange_finish  
              fluid_correctu_next_2  
              fluid_dist_eq_next_0+1  
              fluid_dist_eq_exchange_start+1  
  • [X] Queuing multiple jobs appears to sometimes slowdown unrelated computation routines
    • Defective riser card on GPU asserting power brake and cutting clocks to 1/3rd
  • [X] Duplicate rectange read in profile code (code wasn't currently used)

6. DONE Free up more concurrency [4/4]

  • [X]

    Remove blocking OpenCL calls (all explicit dependencies)

    Done Routine Input Buffers Source Output Buffers
      INITIAL INTEGRATE      
    X fluid_dist_sync_13_finish (prev) fluid_dist_3 [..0] (prev) fluid_vcm_remove_0 [..0] (prev) fluid_dist_3 [1..G+S+F] (prev)
    X fluid_dist_eq_sync_13_finish (prev) fluid_dist_eq_3 [..0] (prev) fluid_vcm_remove_0 [..0] (prev) fluid_dist_eq_3 [1..G+S+F] (prev)
    X fluid_dist_eqn_sync_13_finish (prev) fluid_dist_eqn_3 [..0] (prev) fluid_vcm_remove_0 [..0] (prev) fluid_dist_eqn_3 [1..G+S+F] (prev)
    X fluid_dist_eq_sync_13_finish (vcm) fluid_dist_eq_3 [..0] (vcm) fluid_dist_eq_next_0 [..0] fluid_dist_eq_3 [1..G+S+F] (vcm)
    X fluid_dist_eqn_next_3 fluid_dist_eq_3 [..G+S+F] (vcm) fluid_dist_eq_sync_13_finish [1..G+S+F] (vcm) fluid_dist_eqn_3 [..G+S+F] (vcm)
    X fluid_dist_next_2 fluid_dist_3 [..G+S+F] (prev) fluid_dist_sync_13_finish [1..G+S+F] (prev) fluid_dist_3 [..G+S] (vcm)
        fluid_dist_eq_3 [..G+S+F] (prev) fluid_dist_eq_sync_13_finish [1..G+S+F] (prev)  
        fluid_dist_eqn_3 [..G+S+F] (prev) fluid_dist_eqn_sync_13_finish [1..G+S+F] (prev)  
        fluid_dist_eq_3 [..G+S+F] (vcm) fluid_dist_eq_sync_13_finish [1..G+S+F]  
        fluid_dist_eqn_3 [..G+S+F] (vcm) fluid_dist_eqn_next_3 [..G+S+F]  
    X fluid_param_next_2 fluid_dist_3 [..G+S] (vcm) fluid_dist_next_2 [..G+S] (vcm) fluid_density_2 [..G+S]
            fluid_velocity_2 [..G+S] (half) (vcm)
      PRE FORCE      
    X atom_nonforce_write_1 atom [..G]   atom_position2 [..G]
            atom_velocity [..G]
            atom_type [..G]
            atom_mass [..G]
    X atom_orders_compute_1 atom_position2 [..G] atom_nonforce_write_1 [..G] atom_position2_z [..G]
            atom_index [..G]
    X atom_nonforce_sort_1 atom_position2 [..G] atom_nonforce_write_1 [..G] atom_position2 [..G] (sorted)
        atom_velocity [..G] atom_nonforce_write_1 [..G] atom_velocity [..G] (sorted)
        atom_mass [..G] atom_nonforce_write_1 [..G] atom_mass [..G] (sorted)
        atom_type [..G] atom_nonforce_write_1 [..G] atom_type [..G] (sorted)
        atom_position2_z [..G] atom_orders_compute_1 [..G] atom_position2_z [..G] (sorted)
        atom_index [..G] atom_orders_compute_1 [..G] atom_index [..G] (sorted)
      POST FORCE      
    X atom_force_write_1 atom [..G]   atom_force [..G]
    X atom_force_next_1 fluid_density_2 [..G+S] fluid_param_next_2 [..G+S] atom_force [..G] (sorted)
        fluid_velocity_2 [..G+S] (half) (vcm) fluid_param_next_2 [..G+S] (half) (vcm)  
        atom_position2 [..G] (sorted) atom_inputs_sort_1 [..G] (sorted)  
        atom_velocity [..G] (sorted) atom_inputs_sort_1 [..G] (sorted)  
        atom_mass [..G] (sorted) atom_inputs_sort_1 [..G] (sorted)  
        atom_type [..G] (sorted) atom_inputs_sort_1 [..G] (sorted)  
    X fluid_force_next_2 atom_position2 [..G] (sorted) atom_inputs_sort_1 [..G] (sorted) fluid_force_2 [..G+S] (local)
        atom_position2_z [..G] (sorted) atom_inputs_sort_1 [..G] (sorted)  
        atom_force [..G] (sorted) atom_force_next_1 [..G] (sorted)  
    X fluid_force_sync_20_start fluid_force_2 [0..G+S] (local) fluid_force_next_2 [..G+S] (local)  
    X fluid_force_sync_20_finish fluid_force_2 [-G-S..0] (local) fluid_force_next_2 [..G+S] (local) fluid_force_2 [-G-S..0]
    X fluid_param_u_correct_0 fluid_velocity_2 [..0] (half) (vcm) fluid_param_next_2 [..G+S] (half) (vcm) fluid_velocity_2 [..0] (vcm)
        fluid_density_2 [..0] fluid_param_next_2 [..G+S]  
        fluid_force_2 [..0] fluid_force_sync_20_finish [..0]  
    X atom_force_unsort_1 atom_force [..G] (sorted) atom_force_next_1 [..G] (sorted) atom_force [..G]
        atom_index [..G] (sorted) atom_inputs_sort_1 [..G] (sorted)  
    X atom_force_read_1 atom_force [..G] atom_force_unsort_1 [..G] atom [..G]
      FINAL INTEGRATE      
      END OF STEP      
    X vcm_total_calc_0 atom [..G]   vcm_total
        fluid_density_2 [..0] fluid_param_next_2 [..G+S]  
        fluid_velocity_2 [..0] (vcm) fluid_param_u_correct_0 [..0] (vcm)  
    X fluid_vcm_remove_0 vcm_total vcm_total_calc_0 fluid_velocity_2 [..0]
        fluid_velocity_2 [..0] (vcm) fluid_param_u_correct_0 [..0] (vcm) fluid_dist_3 [..0]
        fluid_dist_3 [..0] (vcm) fluid_dist_next_2 [..G+S] (vcm) fluid_dist_eq_3 [..0]
        fluid_dist_eq_3 [..0] (vcm) fluid_dist_eq_next_0 [..0] (vcm) fluid_dist_eqn_3 [..0]
        fluid_dist_eqn_3 [..0] (vcm) fluid_dist_eqn_next_3 [..G+S+F] (vcm)  
    X restartWrite fluid_force_2 [..0] fluid_force_sync_20_finish [..0]  
        fluid_dist_3 [..0] fluid_vcm_remove_0 [..0]  
        fluid_dist_eq_3 [..0] fluid_vcm_remove_0 [..0]  
        fluid_dist_eqn_3 [..0] fluid_vcm_remove_0 [..0]  
    X fluid_dist_sync_13_start fluid_dist_3 [-G-S-F..-1] fluid_vcm_remove_0 [..0]  
    X fluid_dist_eq_sync_13_start fluid_dist_eq_3 [-G-S-F..-1] fluid_vcm_remove_0 [..0]  
    X fluid_dist_eqn_sync_13_start fluid_dist_eqn_3 [-G-S-F..-1] fluid_vcm_remove_0 [..0]  
    X fluid_dist_eq_next_0 (next) fluid_density_2 [..D] fluid_param_next_2 [..G+S] fluid_dist_eq_3 [..0] (next) (vcm)
        fluid_velocity_2 [..0] fluid_vcm_remove_0 [..0]  
        fluid_force_2 [..0] fluid_force_sync_20_finish [..0]  
    X fluid_dist_eq_sync_13_start (next) (vcm) fluid_dist_eq_3 [-G-S-F..-1] (next) (vcm) fluid_dist_eq_next_0 [..0] (vcm) (next)  
    • fluid_correctu_next_0 (now fluid_param_u_correct_0) adjusts fluid_velocity_2 [0]
      • doesn't adjust any of fluid_dist*_3 [0] (this is correct)
    • fluid_momentum_remove (now fluid_vmc_remove_0) adjusts fluid_dist_3 [0] and fluid_dist_eq_3 [0]
      • should use corrected fluid_velocity_2 [0] instead of computing uncorrected version from fluid_dist*_3 [0]
      • need to also correct fluid_dist_eqn_3 [0..G+S+F] for fluid_dist_next_2 if using exponential integrator
      • can't compute border points [...G+S+F] (only have fluid_velocity_2 [0]) so need to trasfer these
      • need to delay fluid_dist_exchange_3_start until after this
      • need to send borders on other outputs too if using exponential integrator
    • fluid_dist_eq_next_0 computes fluid_dist_eq_3 [0] (next)
      • from fluid_velocity_2 [0] (affected by fluid_correctu_next_0 but not affected by fluid_momentum_remove)
    • fluid_dist_eqn_next_3 computes fluid_dist_eqn_3 [0..G+S+F] (next)
      • from fluid_dist_eq_3 [0] (affected by fluid_correctu_next_0 but not affected by fluid_momentum_remove)
      • could run earlier if fluid_vmc_remove_0 isn't being run this timestep
    • fluid_dist_next_2 computes fluid_dist_3 [0..G+S] (next)
      • from fluid_dist_3 [0..G+S] (not affected by fluid_correctu_next_0 and not affected by fluid_momentum_remove)
      • from fluid_dist_{eq,eqn}_3 [0..G+S+F] (not affected by fluid_correctu_next_0 but affected by fluid_momentum_remove)
      • from fluid_dist_{eq,eqn}_3 [0..G+S+F] (next) (affected by fluid_correctu_next_0 but not affected by fluid_momentum_remove)
    INITIAL INTEGRATE  
    fluid_dist_exchange_13_finish (prev) fluid_dist_3_action (impl)
    fluid_dist_eq_exchange_13_finish (prev) fluid_dist_eq_3_action (impl)
    fluid_dist_eqn_exchange_13_finish (prev) fluid_dist_eqn_3_action (impl)
    fluid_dist_eq_exchange_13_vcm_finish fluid_dist_eq_3_vcm_action (impl)
    fluid_dist_eqn_next_3 fluid_dist_eqn_3_vcm_action
    fluid_dist_next_2 fluid_dist_2_vcm_action
    restartWrite  
    fluid_param_next_2 fluid_velocity_2_half_vcm_action
      fluid_density_2_action
    PRE FORCE  
    atom_lammps_write_1 atom_position1_action
      atom_velocity_1_action
      atom_type_1_action
      atom_mass_1_action
    atom_orders_compute_1 atom_position2_z_1_action
      atom_index_1_action
    atom_inputs_sort_1 atom_position2_1_sorted_action
      atom_velocity_1_sorted_action
      atom_mass_1_sorted_action
      atom_type_1_sorted_action
      atom_position2_z_1_sorted_action
      atom_index_1_sorted_action
    POST FORCE  
    atom_force_next_1 atom_force_1_sorted_action
    fluid_force_next_2 fluid_force_2_local_action
    fluid_force_exchange_02_start  
    fluid_force_exchange_20_finish fluid_force_20_remote_action (impl)
    fluid_force_combine_20 fluid_force_0_action
    atom_force_unsort_1 atom_force_1_action
    atom_force_read_1 FINAL INTEGRATE (impl)
    FINAL INTEGRATE  
    fluid_param_u_correct_0 fluid_velocity_0_vcm_action
    (formally fluid_correctu_next_0)  
    END OF STEP  
    vcm_total_calc_0 vcm_total_action (impl)
    fluid_vmc_remove_0 fluid_velocity_0_action
    (formally fluid_momentum_remove) fluid_dist_0_action
      fluid_dist_eq_0_action
      fluid_dist_eqn_0_action
    fluid_dist_exchange_13_start  
    fluid_dist_eq_exchange_13_start  
    fluid_dist_eqn_exchange_13_start  
    fluid_dist_eq_next_0 (next) fluid_dist_eq_0_vcm_action
    fluid_dist_eq_exchange_13_vcm_start (next)  

    dependencies.svg

  • [X] Print routines should wait on actions/verify range
    • have to change pointers to values
  • [X] Update profile analysis code for changes
  • [X] Can float fluid_param_u_correct_0 up to post_force

7. DONE Change ordering of array in GPU version [12/12]

  • Can convert routines one at a time by wrapping with transpose code
  • [X] Have a look at paraview dump code
  • [X] Switch order of array in kernels [7/7]
    • As we are manually calculating from a 1D index, it is technically aligned now
    • [X] cl_mem fluid_gridgeo_3_mem
    • [X] cl_mem fluid_dist_3_mem
    • [X] cl_mem fluid_dist_eq_3_mem
    • [X] cl_mem fluid_dist_eqn_3_mem
    • [X] cl_mem fluid_density_2_mem
    • [X] cl_mem fluid_velocity_2_mem
    • [X] cl_mem fluid_force_2_mem
  • [X] Fix offsetting in dump routines [7/7]
    • [X] print_fluid
    • [X] print_internal
    • [X] buffer_read_rectangle
    • [X] buffer_write_rectangle
    • [X] fluid_force_accumulate_rectangle
    • [X] calc_mass_momentum
    • [X] calc_MPT
  • [X] Switch order of fluid_grid_geo_3_mem
    • Map memory, directly initialize with strides, and drop sublattice
  • [X] Reverse i,j,k loops for optimal stepping
  • [X] Test new order [2/2]
    • [X] Test MPI boundary exchange
    • [X] Test GPU boundary exchange
  • [X] Fix initialization of fluid_grid_geo_3_mem
    • Copied from sublattice via buffer_create with CL_MEM_COPY_HOST_PTR
    • Wrong type in sizeof for memory map
  • [X] Fix MPI boundary exchange
    • Hadn't updated even/odd offset to now be on k instead of i (largest step)
  • [X] Switch to 3D threads
  • [X] Invesigate Z-curve coordinate for atoms positions (current is C order)
  • [X] Test restart file
    • Are old restart files still valid: no, now in Fortran order
  • [X] Fix restart files under MPI
    • Wrong structure element count in MPI file view type declaration

8. DONE Resync/compare with reference code/LAMMPS [8/8]

  • [X] Comparison with reference code
    • Reference code recomputes forces after full step when not used with fixlbviscousgpu
      • this was introduced with fixlbrigidpcsphere
      • recomputed fluid force used to recompute equilibrium distribution too
    • Leaky borders
      • force stencils can extend into walls
      • pits derivative isn't one-sided along walls (is in CPU code, but doesn't matter as kappa_lb=0)
    • pressurebcx treats duplicate end point differently causing ghost point divergence
  • [X] Update pressurebcx to use symmetrical
    • Use the same point on both sides with adjustwent on both left and right on both
    • Update to the newer density adjustment method
  • [X] Save restart file at end point like particles are saved
    • Need to save velocity as well as updated by fluid_param_u_correct_0
    • Need to update distribution velocities to give final step restart (see cpu code)
  • [X] Leaky edges with pits (shows as dependency on number of ghost points transfered)
    • GPU bbf only bounces back diagonals on edges if both components are wall normal

      • orientation 7 isn't bouncing back directions: 9-13 (likely reversed with 6)
      • orientation 20 isn't bouncing back directions: 7-8, 13-14 (likely reversed with 19)
      ORI Wall Normal Constructive Solid Geometry Bounce Back Components Original (if different)
      0 - -  
      1 (+1, 0, 0) 1, 7,10,11,14  
      2 ( 0,+1, 0) 2, 7, 8,11,12  
      3 ( 0, 0,+1) 5, 7, 8, 9,10  
      4 (+1, 0, 0) and ( 0, 0,+1) 7,10 1, 5, 7,10
      5 (+1, 0, 0) or ( 0, 0,+1) 1, 5, 7, 8, 9,10,11,14  
      6 ( 0,+1, 0) and ( 0, 0,+1) 7, 8 2, 5, 7, 8, 9,10,11,12
      7 ( 0,+1, 0) or ( 0, 0,+1) 2, 5, 7, 8, 9,10,11,12 2, 5, 7, 8
      8 (-1, 0, 0) or ( 0,-1, 0) 3, 4, 8, 9,10,12,13,14  
      9 (-1, 0, 0) or ( 0,+1, 0) 2, 3, 7, 8, 9,11,12,13  
      10 (+1, 0, 0) and ( 0, 0,+1) or ( 0,+1, 0) and ( 0, 0,+1) 7, 8,10  
      11 (+1, 0, 0) and ( 0, 0,+1) or ( 0,-1, 0) and ( 0, 0,+1) 7, 9,10  
      12 (-1, 0, 0) or ( 0,+1, 0) or ( 0, 0, +1) 2, 3, 5, 7, 8, 9,10,11,12,13  
      13 (-1, 0, 0) or ( 0,-1, 0) or ( 0, 0, +1) 3, 4, 5, 7, 8, 9,10,12,13,14  
      14 (-1, 0, 0) 3, 8, 9,12,13  
      15 ( 0,-1, 0) 4, 9,10,13,14  
      16 ( 0, 0,-1) 6,11,12,13,14  
      17 (-1, 0, 0) and ( 0, 0, 1) 8, 9 3, 5, 8, 9
      18 (-1, 0, 0) or ( 0, 0, 1) 3, 5, 7, 8, 9,10,12,13  
      19 ( 0,-1, 0) and ( 0, 0, 1) 9,10 4, 5, 7, 8, 9,10,13,14
      20 ( 0,-1, 0) or ( 0, 0, 1) 4, 5, 7, 8, 9,10,13,14 4, 5, 9,10
      21 ( 1, 0, 0) or ( 0,-1, 0) 1, 4, 7, 9,10,11,13,14  
      22 ( 1, 0, 0) or ( 0, 1, 0) 1, 2, 7, 8,10,11,12,14  
      23 (-1, 0, 0) and ( 0, 0, 1) or ( 0, 1, 0) and ( 0, 0, 1) 7, 8, 9  
      24 (-1, 0, 0) and ( 0, 0, 1) or ( 0,-1, 0) and ( 0, 0, 1) 8, 9,10  
      25 ( 1, 0, 0) or ( 0, 1, 0) or ( 0, 0, 1) 1, 2, 5, 7, 8, 9,10,11,12,14  
      26 ( 1, 0, 0) or ( 0,-1, 0) or ( 0, 0, 1) 1, 4, 5, 7, 8, 9,10,11,13,14  
      27 ( 0, 1, 0) or ( 0, 0,-1) 2, 6, 7, 8,11,12,13,14  
      28 ( 0,-1, 0) or ( 0, 0,-1) 4, 6, 9,10,11,12,13,14  
      29 ( 1, 0, 0) and ( 0, 0, 1) or ( 0, 1, 0) 2, 7, 8,10,11,12  
      30 ( 1, 0, 0) and ( 0, 0, 1) or ( 0,-1, 0) 4, 7, 9,10,13,14  
      31 (-1, 0, 0) and ( 0, 0, 1) or ( 0, 1, 0) 2, 7, 8, 9,11,12  
      32 (-1, 0, 0) and ( 0, 0, 1) or ( 0,-1, 0) 4, 8, 9,10,13,14  
  • [X] Clean up old branches (just master from main tyson)
  • [X] Add vector outputs as in current non-GPU code [2/2]
    • See CPU compute_vector and documented at end of initial comments
    • [X] Scalar is temperature, isn't working 100% can skip
      • [X] Add atom_kinetic (temperature) mirroring atom_spread (allocation, etc.)
      • [X] Need to unsort these (don't actually as just need to sum)
      • [X] Compute degrees of freedom for atom_spread
    • [X] Vector (length 4) is total mass and total momentum
  • [X] Resync with upstream LAMMPS (make some notes on this)
    • need to build the include file for the kernel code (could include it)
  • [X] Update IO, parsing, and string handling

9. DONE Selection of GPU to use [2/2]

  • [X] Specify which OpenCL devices to use
    • Default selection
    • Hostname based overrides
  • [X] Print OpenCL devices used

10. DONE Implement new two pass interpolation [4/4]

  • Allows exact node mass calculation (non-set gamma)
  • Mass ratio requires particle and fluid mass
  • Can re-weight forces on particle
  • [X] Compute the normalization weights
    • Can do the weight interpolation in inital_integrate if we ensure nve goes first (need to warn user if not the case)
    • Or just put it in the post_integrate (put it in pre_force for now as haven't hooked post_integrate)
  • [X]

    Get forces onto GPU for spreading it between fluid and particles to tie velocities together

    • Force on particle k = - hydroforce + mparticle/(mparticle + mstencilfluid) Fparticle-particlek
    • Force on fluid from particle k = hydroforce + mstencilfluid/(mparticle + mstencilfluid) Fparticle-particlek
    • Can just store the first as the difference works out to Fparticle-particlek
    • stencil_density*area * dm_lb is m_stencil_fluid (at end of compute gamma interaction factor comment)
    • The lbviscous routine then would just overwrite the force

    https://www.sciencedirect.com/science/article/pii/S0010465522000364?via%3Dihub

    • Need interpolated fluid mass for CPU calculation (both for force calculation and for temperature)
    • Need to send LAMMPS forces to GPU for force calculation
  • [X] Redo initialization and restart to maintain momentum
  • [X] Implement gamma scaling and negative value hack

11. TODO Resync with reference CPU code [2/4]

  • [X] Linear initialization
  • [X] Stencils [2/2]
    • [X] Add IBM3
    • [X] Replace Peskin with Keys
  • [-] Remove higher order variant [1/2]
    • [X] Remove explicit code
    • [ ]

      Remove extra transfers

      • fluid_dist_next_2: fluid_dist_eq_3_{old,new}, fluid_dist_eqn_3_old
      • fluid_vcm_remove_0: depends on what is needed
      Routine   Input Buffers Source   Output Buffers
      RESTART          
      InitializeFirstRun         fluid_force_2 [..0]
                fluid_dist_3 [..0]
                fluid_dist_eq_3 [..0]
                fluid_dist_eqn_3 [..0]
      fluid_dist_sync_13_start   fluid_dist_3 [-G-S-F..-1] InitializeFirstRun [..0]   fluid_dist_3 [-G-S-F..-1]
      fluid_dist_eq_sync_13_start   fluid_dist_eq_3 [-G-S-F..-1] InitializeFirstRun [..0]   fluid_dist_eq_3 [-G-S-F..-1]
      fluid_dist_eqn_sync_13_start   fluid_dist_eqn_3 [-G-S-F..-1] InitializeFirstRun [..0]   fluid_dist_eqn_3 [-G-S-F..-1]
      SETUP          
      fluid_dist_eq_sync_13_finish (next) (vcm)   fluid_dist_eq_3 [..0] (next) (vcm)     fluid_dist_eq_3 [..G+S+F] (next) (vcm)
          fluid_dist_eq_3 [1..G+S+F] (next) (vcm)      
      fluid_dist_sync_13_finish   fluid_dist_3 [..0]     fluid_dist_3 [..G+S+F]
          fluid_dist_3 [1..G+S+F]      
      manual reset   fluid_dist_3 [..G+S+F]   X fluid_dist_3 [..G+S+F] (vcm)
      fluid_param_next_2   fluid_dist_3 [..G+S] (vcm)   X fluid_density_2 [..G+S/D]
                fluid_velocity_2 [..G+S] (half) (vcm)
      atom_nonforce_write_1   atom [..G]     atom_position2 [..G]
                atom_velocity [..G]
                atom_type [..G]
                atom_mass [..G]
      atom_orders_compute_1   atom_position2 [..G]     atom_position2_z [..G]
                atom_index [..G]
      atom_nonforce_sort_1   atom_position2 [..G]     atom_position2 [..G] (sorted)
          atom_velocity [..G]     atom_velocity [..G] (sorted)
          atom_mass [..G]     atom_mass [..G] (sorted)
          atom_type [..G]     atom_type [..G] (sorted)
          atom_position2_z [..G]     atom_position2_z [..G] (sorted)
          atom_index [..G]   X atom_index [..G] (sorted)
      fluid_weight_sum_2   atom_position2 [..G] (sorted)     fluid_weight_2 [..G+S] (local)
          atom_position2_z [G..] (sorted)      
      fluid_weight_sync_22_start   fluid_weight_2 [-G-S..G+S] (local)     fluid_weight_2 [-G-S..G+S] (local)
      atom_force_write_1   atom [..G]     atom_force [..G]
      atom_force_sort_1   atom_force [..G]     atom_force [..G] (sorted)
          atom_index [..G] (sorted)      
      fluid_weight_sync_22_finish   fluid_weight_2 [..G+S] (local)     fluidweight2 [-G-S..G+S]
          fluid_weight_2 [-G-S..G+S] (remote)      
      atom_force_next_1   fluid_weight_2 [..G+S]     atom_spread [..G] (sorted)
          fluid_density_2 [..G+S]     atom_kinetic [..G] (sorted)
          fluid_velocity_2 [..G+S] (half) (vcm)     atom_force [..G] (fluid) (sorted)
          atom_position2 [..G] (sorted),      
          atom_velocity [..G] (sorted)      
          atom_mass [..G] (sorted)      
          atom_type [..G] (sorted)      
          atom_force [..G] (sorted)      
      fluid_force_next_2 (half)   atom_position2 [..G] (sorted)     fluid_force_2 [..G+S] (local)
          atom_position2_z [..G] (sorted)      
          atom_spread [..G] (sorted),      
          atom_force [..G] (fluid) (sorted)      
          fluid_weight_2 [..G+S]      
          fluid_force_2 [..0]      
      fluid_force_sync_20_start   fluid_force_2 [0..G+S] (local)     fluid_force_2 [0..G+S] (local)
      fluid_force_sync_20_finish   fluid_force_2 [0..G+S] (local)   X fluid_force_2 [..0]
          fluid_force_2 [-G-S..0] (remote)      
      fluid_param_u_correct_0   fluid_density_2 [..0]   X fluid_velocity_2 [..0] (vcm)
          fluid_velocity_2 [..0] (half) (vcm)      
          fluid_force_2 [..0]      
      atom_force_unsort_1   atom_force [..G] (fluid) (sorted)   X atom_force [..G] (fluid)
        X atom_index [..G] (sorted)      
      atom_force_read_1 X atom_force [..G] (fluid) atom_force_unsort_1 [..G] (fluid) X atom [..G]
      manual reset X fluid_velocity_2 [..0] (vcm) fluid_param_next_2 [..G+S] (half) (vcm) -half-? X fluid_velocity_2 [..0]
      manual reset X fluid_dist_3 [..G+S+F] (vcm) manual reset [..G+S+F] (vcm) X fluid_dist_3 [..G+S+F]
      fluid_dist_eq_next_0 X fluid_density_2 [..D] fluid_param_next_2 [..G+S/D] X fluid_dist_eq_3 [..0] (next) (vcm)
        X fluid_velocity_2 [..0] manual reset [..0]    
        X fluid_force_2 [..0] fluid_force_sync_20_finish [..0]    
      fluid_dist_eq_sync_13_start X fluid_dist_eq_3 [-G-S-F..-1] (next) (vcm) fluid_dist_eq_next_0 [..0] (next) (vcm) X fluid_dist_eq_3 [-G-S-F..-1] (next) (vcm)
      INITIAL INTEGRATE          
      fluid_dist_sync_13_finish (prev) X fluid_dist_3 [..0] (prev) fluid_vcm_remove_0 [..0] (prev) X fluid_dist_3 [..G+S+F] (prev)
        X fluid_dist_3 [1..G+S+F] (prev) fluid_dist_sync_13_start [-G-S-F..1] (prev)    
      fluid_dist_eq_sync_13_finish (vcm) X fluid_dist_eq_3 [..0] (vcm) fluid_dist_eq_next_0 [..0] X fluid_dist_eq_3 [..G+S+F] (vcm)
        X fluid_dist_eq_3 [1..G+S+F] (vcm) fluid_dist_eq_sync_13_start [-G-S-F..1] (vcm)    
      fluid_dist_eqn_next_3 X fluid_dist_eq_3 [..G+S+F] (vcm) fluid_dist_eq_sync_13_finish [1..G+S+F] (vcm) X fluid_dist_eqn_3 [..G+S+F] (vcm)
      fluid_dist_next_2 X fluid_dist_3 [..G+S+F] (prev) fluid_dist_sync_13_finish [0..G+S+F] (prev) X fluid_dist_3 [..G+S] (vcm)
        X fluid_dist_eqn_3 [..G+S+F] (vcm) fluid_dist_eqn_next_3 [..G+S+F]    
      fluid_param_next_2 X fluid_dist_3 [..G+S] (vcm) fluid_dist_next_2 [..G+S] (vcm) X fluid_density_2 [..G+S/D]
              X fluid_velocity_2 [..G+S] (half) (vcm)
      PRE FORCE          
      atom_nonforce_write_1 X atom [..G] PRE FORCE X atom_position2 [..G]
              X atom_velocity [..G]
              X atom_type [..G]
              X atom_mass [..G]
      atom_orders_compute_1 X atom_position2 [..G] atom_nonforce_write_1 [..G] X atom_position2_z [..G]
              X atom_index [..G]
      atom_nonforce_sort_1 X atom_position2 [..G] atom_nonforce_write_1 [..G] X atom_position2 [..G] (sorted)
        X atom_velocity [..G] atom_nonforce_write_1 [..G] X atom_velocity [..G] (sorted)
        X atom_mass [..G] atom_nonforce_write_1 [..G] X atom_mass [..G] (sorted)
        X atom_type [..G] atom_nonforce_write_1 [..G] X atom_type [..G] (sorted)
        X atom_position2_z [..G] atom_orders_compute_1 [..G] X atom_position2_z [..G] (sorted)
        X atom_index [..G] atom_orders_compute_1 [..G] X atom_index [..G] (sorted)
      fluid_weight_sum_2 X atom_position2 [..G] (sorted) atom_nonforce_sort_1 [..G] X fluid_weight_2 [..G+S] (local)
        X atom_position2_z [G..] (sorted) atom_nonforce_sort_1 [..G]    
      fluid_weight_sync_22_start X fluid_weight_2 [-G-S..G+S] (local) fluid_weight_sum_2 [..G+S] (local) X fluid_weight_2 [-G-S..G+S] (local)
      POST FORCE          
      atom_force_write_1 X atom [..G] POST FORCE X atom_force [..G]
      atom_force_sort_1 X atom_force [..G] atom_force_write_1 [..G] X atom_force [..G] (sorted)
        X atom_index [..G] (sorted) atom_nonforce_sort_1 [..G]    
      fluid_weight_sync_22_finish X fluid_weight_2 [..G+S] (local) fluid_weight_sum_2 [..G+S] (local) X fluid_weight_2 [..G+S]
        X fluid_weight_2 [-G-S..G+S] (remote) fluid_weight_sync_22_start [-G-S..G+S] (local)    
      atom_force_next_1 X fluid_weight_2 [..G+S] fluid_weight_sync_22_finish [..G+S] X atom_spread [..G] (sorted)
        X fluid_density_2 [..G+S] fluid_param_next_2 [..G+S/D]   atom_kinetic [..G] (sorted)
        X fluid_velocity_2 [..G+S] (half) (vcm) fluid_param_next_2 [..G+S] (half) (vcm) X atom_force [..G] (fluid) (sorted)
        X atom_position2 [..G] (sorted) atom_nonforce_sort_1 [..G] (sorted)    
        X atom_velocity [..G] (sorted) atom_nonforce_sort_1 [..G] (sorted)    
        X atom_mass [..G] (sorted) atom_nonforce_sort_1 [..G] (sorted)    
        X atom_type [..G] (sorted) atom_nonforce_sort_1 [..G] (sorted)    
        X atom_force [..G] (sorted) atom_force [..G] (sorted)    
      fluid_force_next_2 X atom_position2 [..G] (sorted) atom_nonforce_sort_1 [..G] (sorted) X fluid_force_2 [..G+S] (local)
        X atom_position2_z [..G] (sorted) atom_nonforce_sort_1 [..G] (sorted)    
        X atom_spread [..G] (sorted) atom_force_next_1 [..G] (sorted)    
        X atom_force [..G] (fluid) (sorted) atom_force_next_1 [..G] (fluid) (sorted)    
        X fluid_weight_2 [..G+S] fluid_weight_sync_22_finish [..G+S]    
      fluid_force_sync_20_start X fluid_force_2 [0..G+S] (local) fluid_force_next_2 [..G+S] (local) X fluid_force_2 [0..G+S] (local)
      fluid_force_sync_20_finish X fluid_force_2 [..0] (local) fluid_force_next_2 [..G+S] (local) X fluid_force_2 [..0]
        X fluid_force_2 [-G-S..0] (remote) fluid_force_sync_20_start [0..G+S] (local)    
      fluid_param_u_correct_0 X fluid_density_2 [..0] fluid_param_next_2 [..G+S/D] X fluid_velocity_2 [..0] (vcm)
        X fluid_velocity_2 [..0] (half) (vcm) fluid_param_next_2 [..G+S] (half) (vcm)    
        X fluid_force_2 [..0] fluid_force_sync_20_finish [..0]    
      atom_force_unsort_1 X atom_force [..G] (fluid) (sorted) atom_force_next_1 [..G] (fluid) (sorted) X atom_force [..G] (fluid)
        X atom_index [..G] (sorted) atom_nonforce_sort_1 [..G] (sorted)    
      atom_force_read_1 X atom_force [..G] (fluid) atom_force_unsort_1 [..G] (fluid) X atom [..G]
      FINAL INTEGRATE          
      END OF STEP          
      vcm_total_calc_0 X atom [..G] ENDOFSTEP X vcm_total
        X fluid_density_2 [..0] fluid_param_next_2 [..G+S/D]    
        X fluid_velocity_2 [..0] (vcm) fluid_param_u_correct_0 [..0] (vcm)    
      fluid_vcm_remove_0 X vcm_total vcm_total_calc_0 X fluid_velocity_2 [..0]
        X fluid_velocity_2 [..0] (vcm) fluid_param_u_correct_0 [..0] (vcm) X fluid_dist_3 [..0]
        X fluid_dist_3 [..0] (vcm) fluid_dist_next_2 [..G+S] (vcm)   fluid_dist_eq_3 [..0]
          fluid_dist_eq_3 [..0] (vcm) fluid_dist_eq_next_0 [..0] (vcm)   fluid_dist_eqn_3 [..0]
          fluid_dist_eqn_3 [..0] (vcm) fluid_dist_eqn_next_3 [..G+S+F] (vcm)    
      restartWrite   fluid_force_2 [..0] fluid_force_sync_20_finish [..0]    
          fluid_dist_3 [..0] fluid_vcm_remove_0 [..0]    
          fluid_dist_eq_3 [..0] fluid_vcm_remove_0 [..0]    
          fluid_dist_eqn_3 [..0] fluid_vcm_remove_0 [..0]    
      fluid_dist_sync_13_start X fluid_dist_3 [-G-S-F..-1] fluid_vcm_remove_0 [..0] X fluid_dist_3 [-G-S-F..-1]
      fluid_dist_eq_next_0 (next) X fluid_density_2 [..D] fluid_param_next_2 [..G+S/D] X fluid_dist_eq_3 [..0] (next) (vcm)
        X fluid_velocity_2 [..0] fluid_vcm_remove_0 [..0]    
        X fluid_force_2 [..0] fluid_force_sync_20_finish [..0]    
      fluid_dist_eq_sync_13_start (next) (vcm) X fluid_dist_eq_3 [-G-S-F..-1] (next) (vcm) fluid_dist_eq_next_0 [..0] (next) (vcm) X fluid_dist_eq_3 [-G-S-F..-1] (next) (vcm)
  • [ ] Remove explicit equilibrium distribution

12. DONE Discovered code fixes [7/7]

  • [X] Missed one-processor GPU boundary exchange optimization for fluid_force
  • [X] Missing OpenCL event free on most kernel calls
  • [X] Remove momentum code is broken
    • [X] Only waits on OpenCL events (needs to correctly handle MPI)
    • [X] fluid_momentum_remove should be using fluid_velocity_2 and fluid_density_2 so it sees the results of fluid_correctu_next_0
    • [X] Fix should just set flag to enable an not call directly (to get in correct spot)
    • [X] Verify step passed in is correct (should it be step or step+1 – former is correct)
    • [X] Kernel defines dist_3_new = ... (step & 0x01) ... where new should actually be be step+1 & 0x01
    • [X] Test periodic boundary conditions
      • call fix momentum, should leave it zero
      • add small body force, remove momentum every 10 steps, should see grow and then drop
  • [X] Fix lb/fluid/rigid/pc/sphere `omega` calculation diverges a bit on restart
    • likely issue with `setup` `omega` calculation vs `initial/finalintegrate` one
    • replaced by the standard rigid fix (gone in next iteration)
  • [X] fluid_correctu_next_2 is called when we only have fluid_force_2 [0] boundaries
    • verified with Colin and should only be calculating on interior fluid_correctu_next_0
  • [X] Likely shifts momentum in first step due to InitializeFirstRun computing fluid force
    • this would be applied in the first 1/2 step
    • for consistency with other other lammps integrators, should be zero for this
    • first calculation should occur in post force routine
  • [X] Not getting good local size from kernel_local_box
    • Working correctly, was limited by the 64 work groups per compute unit requirement

13. TODO Code cleanups [15/31]

  • [ ] fluid_dist_eqn_next_3 has duplicate non-local prefixed n3, stide3, and offets3 variables
  • [X] Could use some whitespace clean as there is some tailing whitespace and a mix of tabs vs spaces
  • [ ] Should make the offseting in the ?boundaries_* gpu routines more readable (same style as others)
  • [X] Expect the (__global realType*) casts in new gpu code are not required
    • Are actually as it is loading a single component from variable locations
  • [ ] See if now gpu bounce back code could be written in a nicer vectorized way
  • [ ] Checking queue properties of devices should use existing pre-wrapped error check versions
  • [X] Drop local me parameter (currently a mix of local me and comm->me)
  • [ ] Proper setting of typeLB is not verifed (required for things like initialization of K0)
  • [ ] gridgeo_2 memory should be const in all (most?) GPU kernels
  • [X] Backtrace cleanup is a mess and printing garbage [3/3]
    • Need to compile with the -rdynamic option to not strip out static function names
    • [X] Bug due to backtrace allocating one chunk for strings and pointers
    • [X] Exception safety via C++11 unique_ptr rewrite
    • [X] Figure out why function names are included
  • [X] Fix differences between fluid dumping and atom dumping (via dump routines)
    • Dump code (ntimesteps based) and fluid code (step based) are off by one in their
      • update->ntimestep is 0 in setup and then 1 for the first step
      • step is -1 in setup and then 0 for first step
    • Dump code dumps in both setup and regular steps while fluid code only does regular steps
  • [ ] Check with Colin about treating sw (y sidewalls) boundary the same as z one (currently one point short)
  • [ ] Rename lattice variables to match GPU geogrid name
  • [ ] Base lattice size calculations on gpu size variables and not subNb{x,y,z} ones
  • [X]

    Make code cleaner by using post_run (can get rid of everything but send bit in setup)

    • not doing as isn't cleaner due to setup (or init) mismatch with post_run
      • setup is always run the first time and then only if the run pre flag is set
      • post_run is always run
    • how to handle calculation of forces in setup (1/2 step velocity is no longer available)
      • setup compute the forces
        • no need for momentum remaval (user will have disabled if they have introduce net momentum)
      • setup minimal, we are okay, no recomputations required
    constructor setup initial_integrate final_integrate post_run destructor
    init_dist_* new            
    send_dist_* new   recv_dist_* old        
    send_dist_eq new drop_dist_eq new+          
      calc_force          
      xchg_force          
      calc dist_eq new+          
      send_dist_eq new+ recv_dist_eq new        
               
            send dist_* new   recv_dist_* new
            calc dist_eq new+    
            send dist_eq new+   recv_dist_eq new+
    init_dist_* new   if not ~post_run        
      send dist_* new : recv_dist_* old        
      calc_force          
      xchg_force          
      calc dist_eq new+          
      send_dist_eq new+ : recv_dist_eq new        
               
            send dist_* new recv_dist_* new  
            calc dist_eq new+    
            send dist_eq new+ recv_dist_eq new+  
  • [ ] Should be able to disable profiling as may reduce performance
  • [ ] Can probably drop explicit std::string constructors
  • [X] Add ghost points suffix to fluid_momentum_remove_kernel
    • Was renamed to fluid_vcm_remove_0
  • [X] Clean up gpu boundary exchange naming to match others
  • [ ] Cleanup compilation warnings
    • [ ] Fix size_t related narrowing conversion warnings
  • [X] Combine profile averages and only output from first rank
  • [X] fluid_correctu_next_2 should be just fluid_correctu_next_0 as only fluid_force_2 [0] is available
  • [X] Clarify components acted upon in write, rewrite, sort, resort, and read in names
    • Made in final version of action code
  • [ ] Deduplicate buffer specific inner syncing with macros like outer ones
  • [X] sync_inner_* routines' inner target buffers are larger than they need to be (outside_offset_*[1] = mem_border)
  • [X] atom_force_next should clip stencil to avoid past end access if atom goes out of bounds
  • [X] Floating point constants without f suffix wull causing calculation to be in double
  • [X] Consistent handling and naming of boundary conditions
    • Prefix with effect (wall, pressure) and
    • Switch to global location condition inside of GPU
  • [ ] Could skip atom_force_read_1 if gamma scalling is all negative for all atom types
  • [ ] Should switch atom_force (combined) from fluid to atom units to avoid confusion
  • [ ] Replace long with bigint

14. TODO Documentation updates [0/4]

  • [ ] need documentation as to why the routines
  • [ ] and a description of the comments
  • [ ] add tables to documentation (describe the cloumns and the bracketed terms)
  • [ ] note about the HEX file needing to be there for the build

15. TODO Investigate potential GPU options before switch to HIP [5/12]

  • [X] Full profile information dump [2/2]
    • [X] Switch from total runtimes to individual time points
    • [X] Output profile information
  • [X] Remove waits from sorting routines [1/1]
    • should we stick all the event handlers in a queue
    • [X]

      Try on atom_sort_position2_z_* routines

      operation wait between wait at end speedup
      sort_..._bb 7.57 5.98 21.0%
      sort_..._bm 6.82 5.32 22.0%
      combined wall 7.71 5.29 31.4%
  • [ ] Test run with weight set to 1 and with trilinear
    • Is the atom searching or the stencil costing the most
  • [ ] Can optimize complete map writing with CL_MAP_WRITE_INVALIDATE_REGION instead of CL_MAP_WRITE
  • [ ] Mapping pinned memory should give the fastest access
    • NVIDIA OpenCL optimization guide explains how to suggest pinning
  • [ ] Use memory mapping to copy buffers
    • fluid distribution reads and force array
    • may be able to work with unified memory
  • [X] Record starting and stopping times for profiling to determine key path holdups
  • [ ] Non-blocking GPU operations do not nessesarily execute unless there is a flush
    • Likely okay under NVIDIA except for Windows (from NVIDIA OpenCL optimization guide)
    • Could be a reason for the slow down under AMD
  • [ ] Where would it be worthwhile to use local memory
  • [X] Sometimes fluid_force_next_2 is 3.7x slower (1.03 to 3.81ms)
    • tracked down to broken riser card on gra986 asserting hardware power brake on GPU card
  • [ ] Check regular code on thread safe MPI
    • Cluster OpenMPI supports thread if initialized with MPI_Init_thread instead MPI_Init as in LAMMPS
    • Doesn't seem to be any significant slowdown from replacing MPI_Init with MPI_Init_thread
  • [X] Use events between groups

16. Item for future

  • Explore some ideas around geogrid code
    • Use bit vector to track bounce back status for each cell
      • Could store in texture memory as constant for entire simulation
    • Arbitrary geometry support by loading mask of fluid vs non-fluid
      • could the new gridgeo system be used to subsume the wall bounce back code (moving walls likely an issue)
  • Resync with upstream lammps
    • now includes a cmake build system
  • Cleanup old subNbx initialization calculations
    • had intended to in prior DP, but never got around to it
  • Changes in force to handle multiple particles contributing to a grid point better
    • Can likely incorporate into current one pass on GPU
  • Better CPU/GPU border exchange code (current all or nothing)
  • verify GPU options work correctly

Author: Tyson Whitehead

Created: 2023-11-28 Tue 15:21

Validate