Making GitHub Actions Suck a Little Less
A Simple Auto-Retry Workflow for Transient Failures
Do you find yourself babysitting your CI environment because of transients failures like ETIMEDOUT. ECONNRESET. npm ERR! network. 502 Bad Gateway? Yeah us too.
Our problem has gotten even worse lately because of all the AI coding agents under our command. Great throughput requires great responsibility.
The Solution: A Simple Auto-Retry Workflow #
This retry loop:
name: Auto-Retry Failed Workflows
on:
workflow_run:
workflows: ["Deploy"] # Your main workflow name
types: [completed]
branches: [main, dev]
permissions:
actions: write
jobs:
check-and-retry:
runs-on: ubuntu-latest
if: github.event.workflow_run.conclusion == 'failure'
steps:
- name: Check retry count
id: check
env:
GH_TOKEN: $
run: |
ATTEMPT=$
echo "attempt=$ATTEMPT" >> $GITHUB_OUTPUT
# Max 3 total attempts (1 original + 2 retries)
if [ "$ATTEMPT" -ge 3 ]; then
echo "should_retry=false" >> $GITHUB_OUTPUT
else
echo "should_retry=true" >> $GITHUB_OUTPUT
fi
- name: Download and analyze logs
if: steps.check.outputs.should_retry == 'true'
id: analyze
env:
GH_TOKEN: $
run: |
gh run view $ \
--repo $ \
--log-failed > failed_logs.txt 2>&1 || true
# Transient error patterns
TRANSIENT_PATTERNS="ETIMEDOUT|ECONNRESET|ENOTFOUND|rate limit|socket hang up|npm ERR! network|fetch failed|503 Service|502 Bad Gateway|504 Gateway|Connection reset|CERT_HAS_EXPIRED"
if grep -qiE "$TRANSIENT_PATTERNS" failed_logs.txt; then
echo "is_transient=true" >> $GITHUB_OUTPUT
MATCHED=$(grep -oiE "$TRANSIENT_PATTERNS" failed_logs.txt | head -1)
echo "matched_pattern=$MATCHED" >> $GITHUB_OUTPUT
else
echo "is_transient=false" >> $GITHUB_OUTPUT
fi
- name: Re-run failed jobs
if: steps.check.outputs.should_retry == 'true' && steps.analyze.outputs.is_transient == 'true'
env:
GH_TOKEN: $
run: |
gh run rerun $ \
--failed \
--repo $
That’s it, drop this in .github/workflows/auto-retry.yml and you’re done - I’m seeing anything in here specific to our infra besides the name of our github token in the secrets config and the “main” workflow we want watched, plus some branch names.
By the way, this entire workflow was vibecoded. I described the problem to Claude, it wrote the workflow, I reviewed and merged… case in point about the compounding nature of the problem.
How It Works #
Event-driven trigger. The workflow_run event fires immediately when your main workflow completes. The connection is by name, not filename. In our case, workflows: ["Deploy"] matches the name: Deploy field in our deploy workflow. When any of the child jobs fail, the entire workflow is marked as failed, and the retry workflow fires.
# deploy.yml
name: Deploy # <-- This is what workflow_run matches on
on:
push:
branches: [main, dev]
jobs:
lint:
runs-on: ubuntu-latest
steps: [...]
typecheck:
runs-on: ubuntu-latest
steps: [...]
deploy:
needs: [lint, typecheck] # Runs after lint & typecheck pass
runs-on: ubuntu-latest
steps: [...]All these jobs -
lint,typecheck,deploy- are part of the single “Deploy” workflow. Iflintfails due toETIMEDOUT, the retry workflow sees the whole “Deploy” workflow failed and can surgically re-run just thelintjob.
Smart retry logic. GitHub tracks run_attempt automatically. We check if we’re under 3 total attempts before retrying.
Log analysis. Downloads the failed job logs and greps for known transient patterns. If it finds ETIMEDOUT or 502 Bad Gateway, it’s probably worth retrying. If it finds error TS2345: Argument of type 'string'..., we don’t retry for example.
Surgical retry. gh run rerun --failed only re-runs the jobs that failed, not the entire workflow.
Slack notifications. Sends a message when retrying (so you know it’s operating and handling) and when it gives up after max attempts (so you know to look). You stay informed without having to check.
The Transient Pattern List #
These are the patterns that trigger a retry:
| Pattern | What It Catches |
|---|---|
ETIMEDOUT |
Network timeout |
ECONNRESET |
Connection reset by peer |
ENOTFOUND |
DNS resolution failure |
npm ERR! network |
Any npm network error |
rate limit |
GitHub/npm rate limiting |
socket hang up |
Connection dropped |
fetch failed |
Generic fetch failure |
502 Bad Gateway |
Upstream server error |
503 Service |
Service temporarily unavailable |
504 Gateway |
Gateway timeout |
CERT_HAS_EXPIRED |
TLS certificate issues |
When you discover a new transient pattern in the wild, just add it to the regex.
Why Doesn’t GitHub Just Do This Internally? #
Not sure?
Closing Thoughts #
Honestly, this is a lot better.