Going beyond the Github Actions 6 Hour Job Limit

Github Actions is a handy CICD system conveniently built into Github and nicely integrated with the rest of the source code management side of the house. It's pretty flexible and can be trigger in numerous ways and workflows can be connected together to suit a variety of use cases.
Components of various complexity can be broke out and shared as well. It's really nice stuff and way better than the bad old days of managing a sketchy Jenkins node, but it does have a few limitations.

One such limit is a 6 hour maximum time a job is allowed to run. After that, the job will be unkindly killed and the workflow likely will be marked as failed. This applies not only if you are using the Github hosted runners, but also for self hosted runners as well.

For quick builds and tests this isn't an issue, but on occasion you may run into use cases where you really need something to be able to work beyond this limitation.

By leveraging some features of Github Actions and a few shell tricks, I'll demonstrate one such way to work around this.

Today we will cover:

  • Job dependencies
  • Job outputs
  • Controlling job time with timeout
  • Saving and restoring artifacts
  • Other considerations

Job Dependencies

A Github Actions workflow is made up of one or more job that contains one or more steps. The six hour limit for Github Actions is for the cumulative time of a single job. We can use this to our advantage and string together multiple jobs.

Github Actions handles job dependencies by using the needs syntax. For example I may have two jobs try1, and try2. I can make try2 depend on and wait for try1 by adding the following to the try2 job:

  try2:
    needs: try1

For complex workflows, it is possible to have a list of dependencies, or to make execution depend on an 'if' block. Let's leverage that ability in the next step.

Job Outputs

We would now like to be able to control if our next job executes based on a value from a step in our current job. We we can set environment variables and refer to these within a job like so:

    - name: Set env var
      run: echo "MYVAR='1234'" >> $GITHUB_ENV

This won't quite work for us though. Subsequent jobs will not have access to this environment, and there are various other restrictions on what context you can refer this this var. Instead, we need to look into setting outputs.

First we need to set the output from a step within a job:

    - name: Set step output
      id: mystep
      run: echo "done=true" >> $GITHUB_OUTPUT

This syntax mirrors the environment variable setting syntax, but here we indicate that this is a step output. We will want to refer to this particular step later, so we specify an id of mystep as well.

At this point we can refer to our output like so steps.mystep.outputs.done. We can use this in any way we would normally use a Github context such as in logic statements, or other references.

This still is not referrable outside of this job, however. We will need to explicitly indicate that our job has outputs as well. For example:

  try1:
    outputs:
      done: ${{ steps.mystep.outputs.done }}

At this point we can now refer to the outputs from the job try1 from our subsequent jobs that we setup with our dependencies above.

Let's make our second job execution conditional:

  try2:
    if: needs.try1.outputs.done == 'false'
    needs: try1

Now the job try2 will only execute if the output of job try1 is set to the string 'false'. Note that this output is available from the needs context. Without specifying the job in needs we will not have access to this job output.

Okay, we have our basic flow sorted, now how will we run or skip subsequent jobs based on a timeout?

Controlling job timeout

Github Actions does not have a direct way of exiting out of a job after a specific amount of time. Luckily the GNU coreutils timeout command can help us here. This command will let us set a max wait time for a command and will send a signal after the specified timeout. This is quite configurable and will allow us to determine if we want to preserve the original command status, send a different signal besides the default SIGTERM, or if a subsequent SIGKILL will be sent.

A simple example:

$ timeout 30s sleep 90

We will want to get a bit fancier and combine this with our output setting step above:

    - name: Set step output
      id: mystep
      run: timeout 300m ./longcommand.sh && echo "done=true" >> $GITHUB_OUTPUT || echo "done=false" >> $GITHUB_OUTPUT

In the above example, we set a max timeout of 300 minutes for the script longcommand.sh. If this times out done is set to 'false' if we don't time out done is set to 'true'. This is a simple example and you may wish to have more complex logic here.

Note that the timeout is set to 300 minutes, not 360 minutes. Recall that the max job timeout is for the entire job, including any job setup, cleanup, and storage. Here we have added an amble buffer.

Now our step output, and therefore our job output setting will be dynamically set based on the results of this command.

Saving and Restoring Artifacts

In some scenarios, we may wish to conditionally save our work in progress if our job was interrupted, but to do something different if our work is complete. This can be accomplished by leverage the workflow artifact storage feature, while referencing our outputs we setup above. For example:

      - name: Success!
        if: steps.mystep.outputs.done == 'true'
        run: |
          echo "Success we are done!"

      - name: Saving work.
        if: steps.mystep.outputs.done == 'false'
        uses: actions/upload-artifact@v3
        with:
          name: inprogress
          path: ./work
          retention-days: 2

Our downstream job can then restore this artifact before we restart our work:

      - name: Restore inprogress work
        uses: actions/download-artifact@v3
        with:
          name: inprogress

Other Considerations

A full example of the concepts above can be found here.

Depending on the logic of our jobs, and the number of dependent job runs for your use case, it may make sense to capture some of this in a reusable workflow or perhaps a composite action. That'll be a discussion for another day.