Skip to content

Surface pod Events (FailedScheduling, etc.) on waitForPodPhases timeout #366

@remidebette

Description

@remidebette

Problem

When a workflow pod fails to reach Running for a reason that lives in K8s Events rather than on the pod object itself, the hook surfaces only:

##[error]Error: pod failed to come online with error: <generic timeout>
##[error]Executing the custom container implementation failed. Please contact your self hosted runner administrator.

The most common cases:

Event What the user actually needs to see
FailedScheduling 0/14 nodes are available: 13 Too many pods, 1 node was unschedulable.
FailedScheduling Insufficient cpu / Insufficient memory
FailedScheduling node(s) had untolerated taint …
Failed (kubelet) Error: ErrImagePullBackOff predecessors

The ephemeral workflow pod is typically pruned before an operator can kubectl describe it, so the diagnostic is lost.

Scope vs other PRs

  • #336 handles containerStatuses[].state.waiting.{reason,message} on the pod object — covers ImagePullBackOff, missing tags, etc. Doesn't read Events.
  • #341 (merged) fixes the {} empty-message problem in 4 throw sites. Doesn't read Events.
  • #364 (open) fixes the circular-JSON crash. Doesn't read Events.

This issue is the missing third piece: read pod Events when the pod object alone doesn't explain the failure.

Proposed fix (~15 LOC)

In k8s/index.ts waitForPodPhases, in the catch / timeout path, before throwing, fetch the most recent Warning events for the pod and append them to the error message — best-effort, swallow any API failure:

let extra = ''
try {
  const events = await k8sApi.listNamespacedEvent({
    namespace: namespace(),
    fieldSelector: `involvedObject.name=\${podName},type=Warning`
  })
  const warnings = (events.items ?? [])
    .sort((a, b) => +new Date(b.lastTimestamp ?? b.eventTime!) -
                    +new Date(a.lastTimestamp ?? a.eventTime!))
    .slice(0, 3)
    .map(e => \`[\${e.reason}] \${e.message}\`)
  if (warnings.length) extra = \`; events: \${warnings.join('; ')}\`
} catch { /* diagnostic best-effort */ }
throw new Error(
  \`Pod \${podName} is unhealthy with phase status \${phase}: \${formatError(error)}\${extra}\`
)

Unit test: mock listNamespacedEvent to return a FailedScheduling: 0/3 nodes are available: 3 Too many pods. warning; assert that substring shows up in the thrown error.

Happy to open the PR

If the proposal looks right, I can open a PR mirroring the structure of #336 + #364 — small, tightly scoped, with a unit test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions