Production Jobs

Anyscale supports production jobs that are submitted as a standalone package and managed by the platform. These types of jobs are best suited for production workflows where you want Anyscale to automatically handle starting up the cluster and handling failures.

After submitting the job definition to Anyscale, Anyscale will automatically create a cluster, run the job on it, and monitor the job until it succeeds. If the job fails, it will automatically be restarted (up to a configurable number of retries).

Defining and submitting a job

When you submit a production job, you must provide the following:

  • A compute config and cluster environment for the cluster the job will run on.

  • An optional runtime environment containing your application code and dependencies.

    • Note: the working_dir option of the runtime environment must be a remote URI to a zip file such as an S3 or Google Cloud Storage bucket or GitHub URL. The cluster running the job must have permissions to download from that URI.

  • The entrypoint command that will be run on the cluster to run the job.

  • Configuration options for the job, such as a name or the number of times it can be retried before being marked "failed."

These options can be specified in a configuration file:

example_job.yaml
name: my_production_job
compute_config: my_cluster_compute_config
cluster_env: "cluster-env-name:1"
runtime_env:
  working_dir: "s3://my_bucket/my_job_files.zip"
entrypoint: "python my_job_script.py --option1=value1"
max_retries: 3

All of these options together define your job, which can then be submitted to Anyscale using the CLI, Python SDK, or HTTP API.

# Kick off the job defined above using the CLI.
anyscale job submit example_job.yaml
Job my_production_job created, id: job_2xR6uT6t7jJuu1aCwWMsle
Monitor your job in the Anyscale UI at:
https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/jobs/job_2xR6uT6t7jJuu1aCwWMsle

# Query the status of the job using the ID returned above.
anyscale job status job_2xR6uT6t7jJuu1aCwWMsle
status: RUNNING
num_retries: 0
retries_remaining: 3

# Stop the job.
anyscale job stop job_2xR6uT6t7jJuu1aCwWMsle

# Get the status again. Now it should be STOPPED.
anyscale job status job_2xR6uT6t7jJuu1aCwWMsle
status: STOPPED
num_retries: 0
retries_remaining: 0

Monitoring a job

In addition to querying for status using the CLI or SDK, you can also view the status and logs of your job in the Anyscale UI.

TODO: Screenshots of the UI with explanations!!!

Last updated

Was this helpful?