Running Llama3 with vLLM

This tutorial will guide you through serving Llama3 with vLLM on Komodo Cloud. Follow these steps to configure and deploy Llama3 effectively.

Step 1: Create the service config

Create a configuration file for the Llama3 service. Below is a sample service-llama3.yaml file. In the configuration file, replace <REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN> with your HuggingFace token so that model weights are downloaded. You’ll have to request access to LLama3 if you haven’t already.

envs:
  MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
  HF_TOKEN: <REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN>

service:
  replica_policy:
    # by not specifying max_replicas and target_qps_per_second, the service will
    # always run on exactly 1 replica, with no autoscaling
    min_replicas: 1

  readiness_probe:
    initial_delay_seconds: 1800
    path: /health

resources:
  gpus: A100:1
  ports: 8000

setup: |
  pip install vllm vllm-flash-attn

run: |
  vllm serve meta-llama/Meta-Llama-3-8B-Instruct --tensor-parallel-size 1 --port 8000

Step 2: Launch the Llama3 Service

With your configuration file ready, launch the Llama3 service using the CLI:

komo service launch service-llama3.yaml --name llama3-8b

Step 3: Chat with Llama3

Once the service status is RUNNING, you can chat with it right from the dashboard! The endpoint for the service is also provided in the dashboard.

Developer Guide

Reference

Tutorials

Cloud Accounts

Running Llama3 with vLLM

Step 1: Create the service config

Step 2: Launch the Llama3 Service

Step 3: Chat with Llama3

Developer Guide

Reference

Tutorials

Cloud Accounts

​Step 1: Create the service config

​Step 2: Launch the Llama3 Service

​Step 3: Chat with Llama3

Step 1: Create the service config

Step 2: Launch the Llama3 Service

Step 3: Chat with Llama3