Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.komodo.io/llms.txt

Use this file to discover all available pages before exploring further.

This tutorial will guide you through serving Llama3 with vLLM on Komodo Cloud. Follow these steps to configure and deploy Llama3 effectively.

Step 1: Create the service config

Create a configuration file for the Llama3 service. Below is a sample service-llama3.yaml file. In the configuration file, replace <REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN> with your HuggingFace token so that model weights are downloaded. You’ll have to request access to LLama3 if you haven’t already.
envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN>

service:
  replica_policy:
    # by not specifying max_replicas and target_qps_per_second, the service will
    # always run on exactly 1 replica, with no autoscaling
    min_replicas: 1

  readiness_probe:
    initial_delay_seconds: 1800
    path: /health

resources:
  gpus: A100:1
  ports: 8000

setup: |
  pip install vllm vllm-flash-attn

run: |
  vllm serve meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 1 --port 8000

Step 2: Launch the Llama3 Service

With your configuration file ready, launch the Llama3 service using the CLI:
komo service launch service-llama3.yaml --name llama3-8b

Step 3: Chat with Llama3

Once the service status is RUNNING, you can chat with it right from the dashboard! The endpoint for the service is also provided in the dashboard.