Skip to content

Fix preempt on Phoenix, add Frontier walltime #954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 19, 2025

Conversation

sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Jul 19, 2025

User description

  • Fails the Phoenix job if the job reaches preempt state
  • Adds some walltime to Frontier's CI jobs
  • Removes stupid emojis

PR Type

Bug fix, Enhancement


Description

  • Fix Phoenix job handling for PREEMPTED state

  • Increase Frontier CI job walltime from 2 to 3 hours

  • Replace emoji output with plain text messages

  • Improve SLURM job state monitoring reliability


Diagram Walkthrough

flowchart LR
  A["SLURM Job Submission"] --> B["Job State Monitoring"]
  B --> C["Terminal State Check"]
  C --> D["PREEMPTED State Added"]
  E["Frontier Jobs"] --> F["Walltime: 01:59:00"]
  F --> G["Walltime: 02:59:00"]
  H["Emoji Messages"] --> I["Plain Text Messages"]
Loading

File Walkthrough

Relevant files
Configuration changes
submit-bench.sh
Increase benchmark job walltime                                                   

.github/workflows/frontier/submit-bench.sh

  • Increased SLURM job walltime from 01:59:00 to 02:59:00
+1/-1     
submit.sh
Increase standard job walltime                                                     

.github/workflows/frontier/submit.sh

  • Increased SLURM job walltime from 01:59:00 to 02:59:00
+1/-1     
Bug fix
submit-bench.sh
Fix preemption handling and messaging                                       

.github/workflows/phoenix/submit-bench.sh

  • Added PREEMPTED to terminal SLURM job states
  • Replaced emoji messages with plain text equivalents
  • Improved job state monitoring output clarity
+6/-6     
submit.sh
Fix preemption handling and messaging                                       

.github/workflows/phoenix/submit.sh

  • Added PREEMPTED to terminal SLURM job states
  • Replaced emoji messages with plain text equivalents
  • Improved job state monitoring output clarity
+6/-6     

@Copilot Copilot AI review requested due to automatic review settings July 19, 2025 13:28
@sbryngelson sbryngelson added the bug Something isn't working or doesn't seem right label Jul 19, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves CI job handling on HPC systems by addressing job preemption scenarios and extending walltime allocations. The changes focus on making the CI more robust when dealing with SLURM scheduler behavior on Phoenix and Frontier systems.

  • Adds PREEMPTED state handling to Phoenix job submission scripts to properly fail jobs that get preempted
  • Increases walltime from ~2 hours to ~3 hours for Frontier CI jobs to reduce timeout issues
  • Removes emoji characters from log messages for cleaner, more professional output

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
.github/workflows/phoenix/submit.sh Adds PREEMPTED state handling and removes emojis from status messages
.github/workflows/phoenix/submit-bench.sh Adds PREEMPTED state handling and removes emojis from status messages
.github/workflows/frontier/submit.sh Increases job walltime from 01:59:00 to 02:59:00
.github/workflows/frontier/submit-bench.sh Increases job walltime from 01:59:00 to 02:59:00

Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Code Duplication

The exact same SLURM job monitoring logic (state checking, terminal states, exit code handling) is duplicated between submit.sh and submit-bench.sh files. This creates maintenance overhead and potential for inconsistencies.

while :; do
  # Try sacct first
  STATE=$(sacct -j "$JOBID" --format=State --noheader --parsable2 | head -n1)

  # Fallback to squeue if sacct is empty
  if [[ -z "$STATE" ]]; then
    STATE=$(squeue -j "$JOBID" -h -o "%T" || echo "")
  fi

  # If it’s one of SLURM’s terminal states, break immediately
  case "$STATE" in
    COMPLETED|FAILED|CANCELLED|TIMEOUT|PREEMPTED)
      echo "Completed: SLURM job $JOBID reached terminal state: $STATE"
      break
      ;;
    "")
      echo "Completed: SLURM job $JOBID no longer in queue; assuming finished"
      break
      ;;
    *)
      echo "Waiting: SLURM job $JOBID state: $STATE"
      sleep 10
      ;;
  esac
done

# Now retrieve the exit code and exit with it
EXIT_CODE=$(sacct -j "$JOBID" --noheader --format=ExitCode | head -1 | cut -d: -f1)
echo "Completed: SLURM job $JOBID exit code: $EXIT_CODE"
exit "$EXIT_CODE"
Exit Code Bug

The exit code extraction using cut may fail if the ExitCode format is unexpected or empty, potentially causing the script to exit with an incorrect status. Should add validation or default handling.

EXIT_CODE=$(sacct -j "$JOBID" --noheader --format=ExitCode | head -1 | cut -d: -f1)
echo "Completed: SLURM job $JOBID exit code: $EXIT_CODE"
exit "$EXIT_CODE"

Copy link

qodo-merge-pro bot commented Jul 19, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Fix misleading completion message

The message "Completed:" is misleading for failed, cancelled, timeout, or
preempted jobs. Use a more generic message that accurately represents all
terminal states.

.github/workflows/phoenix/submit-bench.sh [89-90]

 COMPLETED|FAILED|CANCELLED|TIMEOUT|PREEMPTED)
-  echo "Completed: SLURM job $JOBID reached terminal state: $STATE"
+  echo "Finished: SLURM job $JOBID reached terminal state: $STATE"
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the log message "Completed:" is inaccurate for non-successful terminal states like FAILED or CANCELLED, and the proposed change improves log clarity.

Low
General
Fix misleading completion message

The message "Completed:" is misleading when a job is no longer in queue, as this
could indicate various states including failures. Use a more neutral message
that doesn't imply successful completion.

.github/workflows/phoenix/submit-bench.sh [94]

-echo "Completed: SLURM job $JOBID no longer in queue; assuming finished"
+echo "Finished: SLURM job $JOBID no longer in queue; assuming finished"
Suggestion importance[1-10]: 4

__

Why: The suggestion correctly identifies that the log message prefix "Completed:" could be misleading, as a job no longer being in the queue doesn't guarantee successful completion, and offers a more neutral alternative.

Low
  • More

@sbryngelson sbryngelson merged commit 0ed69c5 into MFlowCode:master Jul 19, 2025
29 checks passed
Copy link

codecov bot commented Jul 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 44.06%. Comparing base (565178a) to head (cb794ea).
Report is 4 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #954   +/-   ##
=======================================
  Coverage   44.06%   44.06%           
=======================================
  Files          68       68           
  Lines       18220    18220           
  Branches     2292     2292           
=======================================
  Hits         8029     8029           
  Misses       8821     8821           
  Partials     1370     1370           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson deleted the add-preempt branch July 19, 2025 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working or doesn't seem right Review effort 2/5
Development

Successfully merging this pull request may close these issues.

1 participant
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy