Production Jobs

Anyscale supports production jobs that are submitted as a standalone package and managed by the platform. These types of jobs are best suited for production workflows where you want Anyscale to automatically handle starting up the cluster and handling failures.

After submitting the job definition to Anyscale, Anyscale will automatically create a cluster, run the job on it, and monitor the job until it succeeds. If the job fails, it will automatically be restarted (up to a configurable number of retries).

Defining and submitting a job

When you submit a production job, you must provide the following:

  • A compute config and cluster environment for the cluster the job will run on.

  • An optional runtime environment containing your application code and dependencies.

    • Note: the working_dir option of the runtime environment must be a remote URI to a zip file such as an S3 or Google Cloud Storage bucket or GitHub URL. The cluster running the job must have permissions to download from that URI.

  • The entrypoint command that will be run on the cluster to run the job.

  • Configuration options for the job, such as a name or the number of times it can be retried before being marked "failed."

These options can be specified in a configuration file:

example_job.yaml
name: my_production_job
compute_config: my_cluster_compute_config
cluster_env: "cluster-env-name:1"
runtime_env:
  working_dir: "s3://my_bucket/my_job_files.zip"
entrypoint: "python my_job_script.py --option1=value1"
max_retries: 3

All of these options together define your job, which can then be submitted to Anyscale using the CLI, Python SDK, or HTTP API.

Monitoring a job

In addition to querying for status using the CLI or SDK, you can also view the status and logs of your job in the Anyscale UI.

TODO: Screenshots of the UI with explanations!!!

Last updated

Was this helpful?