Skip to content

Probe WRT on GPUs #964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 22, 2025
Merged

Probe WRT on GPUs #964

merged 2 commits into from
Jul 22, 2025

Conversation

anandrdbz
Copy link
Contributor

@anandrdbz anandrdbz commented Jul 21, 2025

User description

Description

Bug fix for probe_wrt on GPUs, where the acceleration and center of mass are written. Previously gave NaNs and had poor performance since this subroutine was not ported to GPUs

Fixes #(issue) [optional]

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Something else

Scope

  • This PR comprises a set of related changes with a common goal

If you cannot check the above box, please split your PR into multiple PRs that each have a common goal.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.
Provide instructions so we can reproduce.
Please also list any relevant details for your test configuration

  • Test A
  • Test B

Test Configuration:

  • What computers and compilers did you use to test this:

Checklist

  • I have added comments for the new code
  • I added Doxygen docstrings to the new code
  • I have made corresponding changes to the documentation (docs/)
  • I have added regression tests to the test suite so that people can verify in the future that the feature is behaving as expected
  • I have added example cases in examples/ that demonstrate my new feature performing as expected.
    They run to completion and demonstrate "interesting physics"
  • I ran ./mfc.sh format before committing my code
  • New and existing tests pass locally with my changes, including with GPU capability enabled (both NVIDIA hardware with NVHPC compilers and AMD hardware with CRAY compilers) and disabled
  • This PR does not introduce any repeated code (it follows the DRY principle)
  • I cannot think of a way to condense this code and reduce any introduced additional line count

If your code changes any code source files (anything in src/simulation)

To make sure the code is performing as expected on GPU devices, I have:

  • Checked that the code compiles using NVHPC compilers
  • Checked that the code compiles using CRAY compilers
  • Ran the code on either V100, A100, or H100 GPUs and ensured the new feature performed as expected (the GPU results match the CPU results)
  • Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
  • Enclosed the new feature via nvtx ranges so that they can be identified in profiles
  • Ran a Nsight Systems profile using ./mfc.sh run XXXX --gpu -t simulation --nsys, and have attached the output file (.nsys-rep) and plain text results to this PR
  • Ran a Rocprof Systems profile using ./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace, and have attached the output file and plain text results to this PR.
  • Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

PR Type

Enhancement


Description

  • Add GPU support for probe write functionality

  • Implement GPU memory management for finite difference coefficients

  • Convert array operations to GPU-compatible loops

  • Add atomic operations for thread-safe center of mass calculations


Diagram Walkthrough

flowchart LR
  A["CPU-only probe writes"] --> B["GPU memory allocation"]
  B --> C["GPU parallel loops"]
  C --> D["Atomic operations"]
  D --> E["GPU-accelerated probe writes"]
Loading

File Walkthrough

Relevant files
Configuration changes
m_checker.fpp
Prohibit probe writes with IGR                                                     

src/simulation/m_checker.fpp

  • Add prohibition check for probe writes with IGR
+1/-0     
Enhancement
m_derived_variables.fpp
GPU support for derived variables computation                       

src/simulation/m_derived_variables.fpp

  • Add GPU memory declarations and allocations
  • Convert array operations to GPU parallel loops
  • Implement GPU memory transfers for coefficients
  • Add atomic operations for thread-safe calculations
+240/-128
m_time_steppers.fpp
GPU-compatible time step cycling                                                 

src/simulation/m_time_steppers.fpp

  • Replace array slice operations with explicit GPU loops
  • Convert time step cycling to GPU-compatible format
+44/-13 

@anandrdbz anandrdbz requested a review from a team as a code owner July 21, 2025 09:12
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Code Duplication

The acceleration component calculation contains significant code duplication across the three coordinate directions (x, y, z). The nested loops and finite difference calculations are nearly identical with only variable names changing. This violates the DRY principle and makes maintenance difficult.

    $:GPU_PARALLEL_LOOP(collapse=3)   
    do l = 0, p
        do k = 0, n
            do j = 0, m
                q_sf(j, k, l) = (11._wp*q_prim_vf0(momxb)%sf(j, k, l) &
                                 - 18._wp*q_prim_vf1(momxb)%sf(j, k, l) &
                                 + 9._wp*q_prim_vf2(momxb)%sf(j, k, l) &
                                 - 2._wp*q_prim_vf3(momxb)%sf(j, k, l))/(6._wp*dt)
            end do 
        end do 
    end do 

    if(n == 0) then 
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxb)%sf(r + j, k, l) 
                    end do 
                end do 
            end do 
        end do
    elseif (p == 0) then 
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxb)%sf(r + j, k, l) &
                                        + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                        q_prim_vf0(momxb)%sf(j, r + k, l)
                    end do 
                end do 
            end do 
        end do
    else 
        if(grid_geometry == 3) then 
            $:GPU_PARALLEL_LOOP(collapse=4) 
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        do r = -fd_number, fd_number
                           q_sf(j, k, l) = q_sf(j, k, l) &
                                            + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                            q_prim_vf0(momxb)%sf(r + j, k, l) &
                                            + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                            q_prim_vf0(momxb)%sf(j, r + k, l) &
                                            + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                            q_prim_vf0(momxb)%sf(j, k, r + l)/y_cc(k)
                        end do 
                    end do 
                end do 
            end do
        else
            $:GPU_PARALLEL_LOOP(collapse=4) 
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        do r = -fd_number, fd_number
                           q_sf(j, k, l) = q_sf(j, k, l) &
                                            + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                            q_prim_vf0(momxb)%sf(r + j, k, l) &
                                            + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                            q_prim_vf0(momxb)%sf(j, r + k, l) &
                                            + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                            q_prim_vf0(momxb)%sf(j, k, r + l)
                        end do 
                    end do 
                end do 
            end do
        end if
    end if
! Computing the acceleration component in the y-coordinate direction
elseif (i == 2) then
    $:GPU_PARALLEL_LOOP(collapse=3)   
    do l = 0, p
        do k = 0, n
            do j = 0, m
                q_sf(j, k, l) = (11._wp*q_prim_vf0(momxb + 1)%sf(j, k, l) &
                                 - 18._wp*q_prim_vf1(momxb + 1)%sf(j, k, l) &
                                 + 9._wp*q_prim_vf2(momxb + 1)%sf(j, k, l) &
                                 - 2._wp*q_prim_vf3(momxb + 1)%sf(j, k, l))/(6._wp*dt)
            end do 
        end do 
    end do 

    if (p == 0) then 
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxb + 1)%sf(r + j, k, l) &
                                        + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                        q_prim_vf0(momxb + 1)%sf(j, r + k, l)
                    end do 
                end do 
            end do 
        end do
    else 
        if(grid_geometry == 3) then 
            $:GPU_PARALLEL_LOOP(collapse=4) 
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        do r = -fd_number, fd_number
                           q_sf(j, k, l) = q_sf(j, k, l) &
                                            + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                            q_prim_vf0(momxb + 1)%sf(r + j, k, l) &
                                            + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                            q_prim_vf0(momxb + 1)%sf(j, r + k, l) &
                                            + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                            q_prim_vf0(momxb + 1)%sf(j, k, r + l)/y_cc(k) &
                                            - (q_prim_vf0(momxe)%sf(j, k, l)**2._wp)/y_cc(k)
                        end do 
                    end do 
                end do 
            end do
        else
            $:GPU_PARALLEL_LOOP(collapse=4) 
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        do r = -fd_number, fd_number
                           q_sf(j, k, l) = q_sf(j, k, l) &
                                            + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                            q_prim_vf0(momxb + 1)%sf(r + j, k, l) &
                                            + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                            q_prim_vf0(momxb + 1)%sf(j, r + k, l) &
                                            + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                            q_prim_vf0(momxb + 1)%sf(j, k, r + l)
                        end do 
                    end do 
                end do 
            end do
        end if
    end if
! Computing the acceleration component in the z-coordinate direction
else
    $:GPU_PARALLEL_LOOP(collapse=3)   
    do l = 0, p
        do k = 0, n
            do j = 0, m
                q_sf(j, k, l) = (11._wp*q_prim_vf0(momxe)%sf(j, k, l) &
                                 - 18._wp*q_prim_vf1(momxe)%sf(j, k, l) &
                                 + 9._wp*q_prim_vf2(momxe)%sf(j, k, l) &
                                 - 2._wp*q_prim_vf3(momxe)%sf(j, k, l))/(6._wp*dt)
            end do 
        end do 
    end do 

    if(grid_geometry == 3) then 
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxe)%sf(r + j, k, l) &
                                        + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                        q_prim_vf0(momxe)%sf(j, r + k, l) &
                                        + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                        q_prim_vf0(momxe)%sf(j, k, r + l)/y_cc(k) &
                                        + (q_prim_vf0(momxe)%sf(j, k, l)* &
                                           q_prim_vf0(momxb + 1)%sf(j, k, l))/y_cc(k)
                    end do 
                end do 
            end do 
        end do
    else
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxe)%sf(r + j, k, l) &
                                        + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                        q_prim_vf0(momxe)%sf(j, r + k, l) &
                                        + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                        q_prim_vf0(momxe)%sf(j, k, r + l)
                    end do 
                end do 
            end do 
        end do
    end if
end if
Variable Reference

The code uses hardcoded momentum indices like momxb, momxe instead of the original mom_idx%beg, mom_idx%end pattern. This change should be verified to ensure these variables are properly defined and accessible in the GPU context.

                q_sf(j, k, l) = (11._wp*q_prim_vf0(momxb)%sf(j, k, l) &
                                 - 18._wp*q_prim_vf1(momxb)%sf(j, k, l) &
                                 + 9._wp*q_prim_vf2(momxb)%sf(j, k, l) &
                                 - 2._wp*q_prim_vf3(momxb)%sf(j, k, l))/(6._wp*dt)
            end do 
        end do 
    end do 

    if(n == 0) then 
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxb)%sf(r + j, k, l) 
                    end do 
                end do 
            end do 
        end do
    elseif (p == 0) then 
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxb)%sf(r + j, k, l) &
                                        + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                        q_prim_vf0(momxb)%sf(j, r + k, l)
                    end do 
                end do 
            end do 
        end do
    else 
        if(grid_geometry == 3) then 
            $:GPU_PARALLEL_LOOP(collapse=4) 
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        do r = -fd_number, fd_number
                           q_sf(j, k, l) = q_sf(j, k, l) &
                                            + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                            q_prim_vf0(momxb)%sf(r + j, k, l) &
                                            + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                            q_prim_vf0(momxb)%sf(j, r + k, l) &
                                            + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                            q_prim_vf0(momxb)%sf(j, k, r + l)/y_cc(k)
                        end do 
                    end do 
                end do 
            end do
        else
            $:GPU_PARALLEL_LOOP(collapse=4) 
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        do r = -fd_number, fd_number
                           q_sf(j, k, l) = q_sf(j, k, l) &
                                            + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                            q_prim_vf0(momxb)%sf(r + j, k, l) &
                                            + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                            q_prim_vf0(momxb)%sf(j, r + k, l) &
                                            + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                            q_prim_vf0(momxb)%sf(j, k, r + l)
                        end do 
                    end do 
                end do 
            end do
        end if
    end if
! Computing the acceleration component in the y-coordinate direction
elseif (i == 2) then
    $:GPU_PARALLEL_LOOP(collapse=3)   
    do l = 0, p
        do k = 0, n
            do j = 0, m
                q_sf(j, k, l) = (11._wp*q_prim_vf0(momxb + 1)%sf(j, k, l) &
                                 - 18._wp*q_prim_vf1(momxb + 1)%sf(j, k, l) &
                                 + 9._wp*q_prim_vf2(momxb + 1)%sf(j, k, l) &
                                 - 2._wp*q_prim_vf3(momxb + 1)%sf(j, k, l))/(6._wp*dt)
            end do 
        end do 
    end do 

    if (p == 0) then 
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxb + 1)%sf(r + j, k, l) &
                                        + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                        q_prim_vf0(momxb + 1)%sf(j, r + k, l)
                    end do 
                end do 
            end do 
        end do
    else 
        if(grid_geometry == 3) then 
            $:GPU_PARALLEL_LOOP(collapse=4) 
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        do r = -fd_number, fd_number
                           q_sf(j, k, l) = q_sf(j, k, l) &
                                            + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                            q_prim_vf0(momxb + 1)%sf(r + j, k, l) &
                                            + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                            q_prim_vf0(momxb + 1)%sf(j, r + k, l) &
                                            + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                            q_prim_vf0(momxb + 1)%sf(j, k, r + l)/y_cc(k) &
                                            - (q_prim_vf0(momxe)%sf(j, k, l)**2._wp)/y_cc(k)
                        end do 
                    end do 
                end do 
            end do
        else
            $:GPU_PARALLEL_LOOP(collapse=4) 
            do l = 0, p
                do k = 0, n
                    do j = 0, m
                        do r = -fd_number, fd_number
                           q_sf(j, k, l) = q_sf(j, k, l) &
                                            + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                            q_prim_vf0(momxb + 1)%sf(r + j, k, l) &
                                            + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                            q_prim_vf0(momxb + 1)%sf(j, r + k, l) &
                                            + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                            q_prim_vf0(momxb + 1)%sf(j, k, r + l)
                        end do 
                    end do 
                end do 
            end do
        end if
    end if
! Computing the acceleration component in the z-coordinate direction
else
    $:GPU_PARALLEL_LOOP(collapse=3)   
    do l = 0, p
        do k = 0, n
            do j = 0, m
                q_sf(j, k, l) = (11._wp*q_prim_vf0(momxe)%sf(j, k, l) &
                                 - 18._wp*q_prim_vf1(momxe)%sf(j, k, l) &
                                 + 9._wp*q_prim_vf2(momxe)%sf(j, k, l) &
                                 - 2._wp*q_prim_vf3(momxe)%sf(j, k, l))/(6._wp*dt)
            end do 
        end do 
    end do 

    if(grid_geometry == 3) then 
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxe)%sf(r + j, k, l) &
                                        + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                        q_prim_vf0(momxe)%sf(j, r + k, l) &
                                        + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                        q_prim_vf0(momxe)%sf(j, k, r + l)/y_cc(k) &
                                        + (q_prim_vf0(momxe)%sf(j, k, l)* &
                                           q_prim_vf0(momxb + 1)%sf(j, k, l))/y_cc(k)
                    end do 
                end do 
            end do 
        end do
    else
        $:GPU_PARALLEL_LOOP(collapse=4) 
        do l = 0, p
            do k = 0, n
                do j = 0, m
                    do r = -fd_number, fd_number
                       q_sf(j, k, l) = q_sf(j, k, l) &
                                        + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
                                        q_prim_vf0(momxe)%sf(r + j, k, l) &
                                        + q_prim_vf0(momxb + 1)%sf(j, k, l)*fd_coeff_y(r, k)* &
                                        q_prim_vf0(momxe)%sf(j, r + k, l) &
                                        + q_prim_vf0(momxe)%sf(j, k, l)*fd_coeff_z(r, l)* &
                                        q_prim_vf0(momxe)%sf(j, k, r + l)
                    end do 
Atomic Operations

The center of mass calculation uses atomic updates for accumulating values in parallel loops. While necessary for correctness, this could create performance bottlenecks on GPUs due to serialization of memory access. Consider using reduction operations instead.

                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 1) = c_m(i, 1) + q_vf(i)%sf(j, k, l)*dV
                    ! x-location weighted
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 2) = c_m(i, 2) + q_vf(i)%sf(j, k, l)*dV*x_cc(j)
                    ! Volume fraction
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 5) = c_m(i, 5) + q_vf(i + advxb - 1)%sf(j, k, l)*dV
                end do
            end do
        end do
    end do
elseif (p == 0) then !2D simulation
    $:GPU_PARALLEL_LOOP(collapse=3,private='[dV]')
    do l = 0, p !Loop over grid
        do k = 0, n
            do j = 0, m
                $:GPU_LOOP(parallelism='[seq]')
                do i = 1, num_fluids !Loop over individual fluids
                    dV = dx(j)*dy(k)
                    ! Mass
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 1) = c_m(i, 1) + q_vf(i)%sf(j, k, l)*dV
                    ! x-location weighted
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 2) = c_m(i, 2) + q_vf(i)%sf(j, k, l)*dV*x_cc(j)
                    ! y-location weighted
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 3) = c_m(i, 3) + q_vf(i)%sf(j, k, l)*dV*y_cc(k)
                    ! Volume fraction
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 5) = c_m(i, 5) + q_vf(i + advxb - 1)%sf(j, k, l)*dV
                end do
            end do
        end do
    end do
else !3D simulation
    $:GPU_PARALLEL_LOOP(collapse=3,private='[dV]')
    do l = 0, p !Loop over grid
        do k = 0, n
            do j = 0, m
                $:GPU_LOOP(parallelism='[seq]')
                do i = 1, num_fluids !Loop over individual fluids

                    dV = dx(j)*dy(k)*dz(l)
                    ! Mass
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 1) = c_m(i, 1) + q_vf(i)%sf(j, k, l)*dV
                    ! x-location weighted
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 2) = c_m(i, 2) + q_vf(i)%sf(j, k, l)*dV*x_cc(j)
                    ! y-location weighted
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 3) = c_m(i, 3) + q_vf(i)%sf(j, k, l)*dV*y_cc(k)
                    ! z-location weighted
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 4) = c_m(i, 4) + q_vf(i)%sf(j, k, l)*dV*z_cc(l)
                    ! Volume fraction
                    $:GPU_ATOMIC(atomic='update')
                    c_m(i, 5) = c_m(i, 5) + q_vf(i + advxb - 1)%sf(j, k, l)*dV

Copy link

qodo-merge-pro bot commented Jul 21, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Add bounds checking for array access

The array access q_prim_vf0(momxb)%sf(r + j, k, l) can cause out-of-bounds
memory access when r + j exceeds array bounds. Add boundary checks to prevent
accessing invalid memory locations, which could cause crashes or incorrect
results.

src/simulation/m_derived_variables.fpp [219-230]

 $:GPU_PARALLEL_LOOP(collapse=4) 
 do l = 0, p
     do k = 0, n
         do j = 0, m
             do r = -fd_number, fd_number
-               q_sf(j, k, l) = q_sf(j, k, l) &
-                                + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
-                                q_prim_vf0(momxb)%sf(r + j, k, l) 
+               if (r + j >= 0 .and. r + j <= m) then
+                   q_sf(j, k, l) = q_sf(j, k, l) &
+                                    + q_prim_vf0(momxb)%sf(j, k, l)*fd_coeff_x(r, j)* &
+                                    q_prim_vf0(momxb)%sf(r + j, k, l) 
+               end if
             end do 
         end do 
     end do 
 end do
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a potential out-of-bounds memory access in q_prim_vf0(momxb)%sf(r + j, k, l), which is a critical bug that could lead to incorrect results or crashes.

High
Fix data dependencies in parallel loops

The cascading assignments create data dependencies that prevent proper GPU
parallelization. The assignments should be done in separate loops or use
temporary variables to avoid race conditions and ensure correct execution order.

src/simulation/m_time_steppers.fpp [1122-1134]

+! Use separate loops to avoid data dependencies
 $:GPU_PARALLEL_LOOP(collapse=4)
 do i = 1, sys_size
     do l = 0, p 
         do k = 0, n 
             do j = 0, m 
-                q_prim_ts(3)%vf(i)%sf(j, k, l) = q_prim_ts(2)%vf(i)%sf(j, k, l)
-                q_prim_ts(2)%vf(i)%sf(j, k, l) = q_prim_ts(1)%vf(i)%sf(j, k, l)
-                q_prim_ts(1)%vf(i)%sf(j, k, l) = q_prim_ts(0)%vf(i)%sf(j, k, l)
-                q_prim_ts(0)%vf(i)%sf(j, k, l) = q_prim_vf(i)%sf(j, k, l)
+                temp_val = q_prim_ts(2)%vf(i)%sf(j, k, l)
+                q_prim_ts(3)%vf(i)%sf(j, k, l) = temp_val
             end do 
         end do 
     end do
 end do
+! Continue with separate loops for other assignments
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a read-after-write data dependency within a parallel loop, which creates a race condition and will lead to incorrect results.

High
  • Update

Copy link

codecov bot commented Jul 21, 2025

Codecov Report

Attention: Patch coverage is 15.54054% with 125 lines in your changes missing coverage. Please review.

Project coverage is 44.03%. Comparing base (f2ef560) to head (e576795).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/simulation/m_derived_variables.fpp 6.45% 115 Missing and 1 partial ⚠️
src/simulation/m_time_steppers.fpp 65.21% 8 Missing ⚠️
src/simulation/m_checker.fpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #964      +/-   ##
==========================================
- Coverage   44.08%   44.03%   -0.05%     
==========================================
  Files          69       69              
  Lines       19573    19630      +57     
  Branches     2428     2428              
==========================================
+ Hits         8628     8645      +17     
- Misses       9444     9484      +40     
  Partials     1501     1501              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson
Copy link
Member

Is src/simulation/m_time_steppers.fpp [1122-1134] not a race condition on GPU?

@anandrdbz
Copy link
Contributor Author

@sbryngelson I don't think so, because each thread is only reading and writing from one location in the array

@sbryngelson sbryngelson merged commit 31b52be into MFlowCode:master Jul 22, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy