@retain_dev sure , example terminates the instant it crashes. all the time.
artifacts: run with –s3 and the whole thing in outputs/ syncs for your S3 bucket earlier than the example is long gone. crash at epoch 40 — your closing checkpoint is already in S3.
one fair resolution: mid-run snapshotting is on you to cord for your coaching script for now. save checkpoints to outputs/ periodically. crunr handles the remaining.
that mentioned , computerized mid-job checkpointing is transport subsequent week. crash anyplace, resume from precisely the place you left off. no wiring wanted.
the idle example after a crash drawback : already solved. the resume drawback — one week away. 🙏



