Problem
When a workflow pod fails to reach Running for a reason that lives in K8s Events rather than on the pod object itself, the hook surfaces only:
##[error]Error: pod failed to come online with error: <generic timeout>
##[error]Executing the custom container implementation failed. Please contact your self hosted runner administrator.
The most common cases:
| Event |
What the user actually needs to see |
FailedScheduling |
0/14 nodes are available: 13 Too many pods, 1 node was unschedulable. |
FailedScheduling |
Insufficient cpu / Insufficient memory |
FailedScheduling |
node(s) had untolerated taint … |
Failed (kubelet) |
Error: ErrImagePullBackOff predecessors |
The ephemeral workflow pod is typically pruned before an operator can kubectl describe it, so the diagnostic is lost.
Scope vs other PRs
- #336 handles
containerStatuses[].state.waiting.{reason,message} on the pod object — covers ImagePullBackOff, missing tags, etc. Doesn't read Events.
- #341 (merged) fixes the
{} empty-message problem in 4 throw sites. Doesn't read Events.
- #364 (open) fixes the circular-JSON crash. Doesn't read Events.
This issue is the missing third piece: read pod Events when the pod object alone doesn't explain the failure.
Proposed fix (~15 LOC)
In k8s/index.ts waitForPodPhases, in the catch / timeout path, before throwing, fetch the most recent Warning events for the pod and append them to the error message — best-effort, swallow any API failure:
let extra = ''
try {
const events = await k8sApi.listNamespacedEvent({
namespace: namespace(),
fieldSelector: `involvedObject.name=\${podName},type=Warning`
})
const warnings = (events.items ?? [])
.sort((a, b) => +new Date(b.lastTimestamp ?? b.eventTime!) -
+new Date(a.lastTimestamp ?? a.eventTime!))
.slice(0, 3)
.map(e => \`[\${e.reason}] \${e.message}\`)
if (warnings.length) extra = \`; events: \${warnings.join('; ')}\`
} catch { /* diagnostic best-effort */ }
throw new Error(
\`Pod \${podName} is unhealthy with phase status \${phase}: \${formatError(error)}\${extra}\`
)
Unit test: mock listNamespacedEvent to return a FailedScheduling: 0/3 nodes are available: 3 Too many pods. warning; assert that substring shows up in the thrown error.
Happy to open the PR
If the proposal looks right, I can open a PR mirroring the structure of #336 + #364 — small, tightly scoped, with a unit test.
Problem
When a workflow pod fails to reach
Runningfor a reason that lives in K8s Events rather than on the pod object itself, the hook surfaces only:The most common cases:
FailedScheduling0/14 nodes are available: 13 Too many pods, 1 node was unschedulable.FailedSchedulingInsufficient cpu/Insufficient memoryFailedSchedulingnode(s) had untolerated taint …Failed(kubelet)Error: ErrImagePullBackOffpredecessorsThe ephemeral workflow pod is typically pruned before an operator can
kubectl describeit, so the diagnostic is lost.Scope vs other PRs
containerStatuses[].state.waiting.{reason,message}on the pod object — coversImagePullBackOff, missing tags, etc. Doesn't read Events.{}empty-message problem in 4 throw sites. Doesn't read Events.This issue is the missing third piece: read pod Events when the pod object alone doesn't explain the failure.
Proposed fix (~15 LOC)
In
k8s/index.tswaitForPodPhases, in the catch / timeout path, before throwing, fetch the most recentWarningevents for the pod and append them to the error message — best-effort, swallow any API failure:Unit test: mock
listNamespacedEventto return aFailedScheduling: 0/3 nodes are available: 3 Too many pods.warning; assert that substring shows up in the thrown error.Happy to open the PR
If the proposal looks right, I can open a PR mirroring the structure of #336 + #364 — small, tightly scoped, with a unit test.