This tutorial will guide you through serving Llama3 with vLLM on Komodo Cloud. Follow these steps to configure and deploy Llama3 effectively.

Step 1: Create the service config

Create a configuration file for the Llama3 service. Below is a sample service-llama3.yaml file.

In the configuration file, replace <REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN> with your HuggingFace token so that model weights are downloaded. You’ll have to request access to LLama3 if you haven’t already.

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN>

service:
  replica_policy:
    # by not specifying max_replicas and target_qps_per_second, the service will
    # always run on exactly 1 replica, with no autoscaling
    min_replicas: 1

  readiness_probe:
    initial_delay_seconds: 1800
    path: /health

resources:
  gpus: A100:1
  ports: 8000

setup: |
  pip install vllm vllm-flash-attn

run: |
  vllm serve meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 1 --port 8000

Step 2: Launch the Llama3 Service

With your configuration file ready, launch the Llama3 service using the CLI:

komo service launch service-llama3.yaml --name llama3-8b

Step 3: Chat with Llama3

Once the service status is RUNNING, you can chat with it right from the dashboard! The endpoint for the service is also provided in the dashboard.