January 14, 2026

Making GitHub Actions Suck a Little Less

A Simple Auto-Retry Workflow for Transient Failures

Do you find yourself babysitting your CI environment because of transients failures like ETIMEDOUT. ECONNRESET. npm ERR! network. 502 Bad Gateway? Yeah us too.

Our problem has gotten even worse lately because of all the AI coding agents under our command. Great throughput requires great responsibility.

The Solution: A Simple Auto-Retry Workflow #

This retry loop:

name: Auto-Retry Failed Workflows

on:
  workflow_run:
    workflows: ["Deploy"] # Your main workflow name
    types: [completed]
    branches: [main, dev]

permissions:
  actions: write

jobs:
  check-and-retry:
    runs-on: ubuntu-latest
    if: github.event.workflow_run.conclusion == 'failure'
    steps:
      - name: Check retry count
        id: check
        env:
          GH_TOKEN: $
        run: |
          ATTEMPT=$
          echo "attempt=$ATTEMPT" >> $GITHUB_OUTPUT

          # Max 3 total attempts (1 original + 2 retries)
          if [ "$ATTEMPT" -ge 3 ]; then
            echo "should_retry=false" >> $GITHUB_OUTPUT
          else
            echo "should_retry=true" >> $GITHUB_OUTPUT
          fi

      - name: Download and analyze logs
        if: steps.check.outputs.should_retry == 'true'
        id: analyze
        env:
          GH_TOKEN: $
        run: |
          gh run view $ \
            --repo $ \
            --log-failed > failed_logs.txt 2>&1 || true

          # Transient error patterns
          TRANSIENT_PATTERNS="ETIMEDOUT|ECONNRESET|ENOTFOUND|rate limit|socket hang up|npm ERR! network|fetch failed|503 Service|502 Bad Gateway|504 Gateway|Connection reset|CERT_HAS_EXPIRED"

          if grep -qiE "$TRANSIENT_PATTERNS" failed_logs.txt; then
            echo "is_transient=true" >> $GITHUB_OUTPUT
            MATCHED=$(grep -oiE "$TRANSIENT_PATTERNS" failed_logs.txt | head -1)
            echo "matched_pattern=$MATCHED" >> $GITHUB_OUTPUT
          else
            echo "is_transient=false" >> $GITHUB_OUTPUT
          fi

      - name: Re-run failed jobs
        if: steps.check.outputs.should_retry == 'true' && steps.analyze.outputs.is_transient == 'true'
        env:
          GH_TOKEN: $
        run: |
          gh run rerun $ \
            --failed \
            --repo $

That’s it, drop this in .github/workflows/auto-retry.yml and you’re mostly done - there’s not much in here specific to our infra besides the name of our github token in the secrets config, the “main” workflow we want watched, and some branch names.

By the way, this entire workflow was vibecoded. I described the problem to Claude, it wrote the workflow, I reviewed and merged… case in point about the compounding nature of the problem.

How It Works #

Event-driven trigger. The workflow_run event fires immediately when your main workflow completes. The connection is by name, not filename. In our case, workflows: ["Deploy"] matches the name: Deploy field in our deploy workflow. When any of the child jobs fail, the entire workflow is marked as failed, and the retry workflow fires.

# deploy.yml
name: Deploy # <-- This is what workflow_run matches on

on:
  push:
    branches: [main, dev]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps: [...]

  typecheck:
    runs-on: ubuntu-latest
    steps: [...]

  deploy:
    needs: [lint, typecheck] # Runs after lint & typecheck pass
    runs-on: ubuntu-latest
    steps: [...]
All these jobs - lint, typecheck, deploy - are part of the single “Deploy” workflow. If lint fails due to ETIMEDOUT, the retry workflow sees the whole “Deploy” workflow failed and can surgically re-run just the lint job.

Smart retry logic. GitHub tracks run_attempt automatically. We check if we’re under 3 total attempts before retrying.

Log analysis. Downloads the failed job logs and greps for known transient patterns. If it finds ETIMEDOUT or 502 Bad Gateway, it’s probably worth retrying. If it finds error TS2345: Argument of type 'string'..., we don’t retry for example.

Surgical retry. gh run rerun --failed only re-runs the jobs that failed, not the entire workflow.

Slack notifications. Sends a message when retrying (so you know it’s on it) and when it gives up after max attempts (so you know to look) by adding the sections below and updating your GH repo secrets with your slack info.

- name: Notify Slack - Retrying
  if: steps.check.outputs.should_retry == 'true' && steps.analyze.outputs.is_transient == 'true'
  uses: 8398a7/action-slack@v3
  with:
    status: custom
    custom_payload: |
      {
        "text": ":recycle: Auto-retrying failed workflow (attempt $/3)",
        "attachments": [{
          "color": "warning",
          "fields": [
            { "title": "Branch", "value": "$", "short": true },
            { "title": "Error", "value": "$", "short": true },
            { "title": "Run", "value": "<$|View Failed Run>", "short": false }
          ]
        }]
      }
  env:
    SLACK_WEBHOOK_URL: $

- name: Notify Slack - Final Failure
  if: steps.check.outputs.should_retry == 'false' || steps.analyze.outputs.is_transient == 'false'
  uses: 8398a7/action-slack@v3
  with:
    status: custom
    custom_payload: |
      {
        "text": ":x: Workflow failed after $ attempts (not auto-retrying)",
        "attachments": [{
          "color": "danger",
          "fields": [
            { "title": "Branch", "value": "$", "short": true },
            { "title": "Reason", "value": "$", "short": true },
            { "title": "Run", "value": "<$|View Failed Run>", "short": false }
          ]
        }]
      }
  env:
    SLACK_WEBHOOK_URL: $

The Transient Pattern List #

These are the patterns that trigger a retry:

Pattern	What It Catches
`ETIMEDOUT`	Network timeout
`ECONNRESET`	Connection reset by peer
`ENOTFOUND`	DNS resolution failure
`npm ERR! network`	Any npm network error
`rate limit`	GitHub/npm rate limiting
`socket hang up`	Connection dropped
`fetch failed`	Generic fetch failure
`502 Bad Gateway`	Upstream server error
`503 Service`	Service temporarily unavailable
`504 Gateway`	Gateway timeout
`CERT_HAS_EXPIRED`	TLS certificate issues

When you discover a new transient pattern in the wild, just add it to the regex.

Why Doesn’t GitHub Just Do This Internally? #

Not sure?

Closing Thoughts #

Honestly, this is a lot better.