📝 docs for local development on CPU (#161)

prashantgupta24 · joerunde · web-flow · commit 4bf2c8c43406 · 2025-05-23T17:44:30.000Z
Add documentation on how to run the `eager` (CPU) examples + tests on `arm64` and `x86_64`. See it live - https://vllm-spyre--161.org.readthedocs.build/en/161/getting_started/local_development.html ~Will wait for #159 to merge first~ --------- Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
diff --git a/_local_envs_for_test.sh b/_local_envs_for_test.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+
+# Env vars to set for CPU-only testing
+
+# Need to be set for tests to run
+export MASTER_ADDR=localhost
+export MASTER_PORT=29500
+
+# Run on CPU
+export VLLM_SPYRE_DYNAMO_BACKEND=eager
+
+# TODO: Tests don't work on CPU with MP enabled?
+export VLLM_ENABLE_V1_MULTIPROCESSING=0
+
+# Test related
+export VLLM_SPYRE_TEST_BACKEND_LIST=eager
+# Note: Make sure model name aligns with the model that you downloaded 
+export VLLM_SPYRE_TEST_MODEL_LIST="JackFram/llama-160m"
+export VLLM_SPYRE_TEST_MODEL_DIR=""
+# We have to use `HF_HUB_OFFLINE=1` otherwise vllm tries to download a
+# different version of the model using HF API which does not work locally
+export HF_HUB_OFFLINE=1
diff --git a/docs/source/contributing/overview.md b/docs/source/contributing/overview.md
@@ -43,3 +43,108 @@ Commits must include a `Signed-off-by:` header which certifies agreement with
 the terms of the DCO.
 
 Using `-s` with `git commit` will automatically add this header.
+
+## Testing
+
+### Running tests locally on CPU (No Spyre card)
+  
+1. (arm64 only) Install xgrammar
+  
+   :::{tip}
+   It's installed for x86_64 automatically.
+   :::
+
+   ```sh
+   uv pip install xgrammar==0.1.19
+   ```
+
+1. (optional) Download `JackFram/llama-160m` model for tests
+
+   ```sh
+   python -c "from transformers import pipeline; pipeline('text-generation', model='JackFram/llama-160m')"
+   ```
+
+   :::{caution}
+   Downloading the same model using HF API does not work locally on `arm64`.
+   :::
+
+   :::{tip}
+   :class: dropdown
+   We assume the model lands here:
+
+   ```sh
+   .cache/huggingface/hub/models--JackFram--llama-160m
+   ```
+  
+    :::
+
+1. Source env variables needed for tests
+
+   ```sh
+   source _local_envs_for_test.sh
+   ```
+
+1. (optional) Install dev dependencies (if vllm-spyre was installed without them)
+  
+   ```sh
+   uv pip install --group dev
+   ```
+
+1. Run the tests:
+  
+   ```sh
+   python -m pytest -v -x tests -m "v1 and cpu and e2e"
+   ```
+
+### Continuous Batching Tests (CB)
+
+:::{attention}
+Temporary section until FMS custom branch is merged to main
+:::
+
+Continuous batching requires a custom installation at the moment until the FMS custom branch is merged to main.
+
+To try it out, after following all steps for setting up testing as mentioned above,
+
+1. Install custom FMS branch for CB:
+
+   ```sh
+   uv pip install git+https://github.com/foundation-model-stack/foundation-model-stack.git@paged_attn_mock --force-reinstall
+   ```
+
+#### Run only CB tests
+
+```sh
+python -m pytest -v -x tests/e2e -m cb
+```
+
+## Debugging
+
+We can debug using `debugpy` in VS code.
+
+ This is the content of the `launch.json` file which we need for debugging in VS code:
+
+```sh
+{
+  "version": "0.2.0",
+  "configurations": [
+    {
+      "name": "Python Debugger: local",
+      "type": "debugpy",
+      "request": "attach",
+      "connect": {
+        "host": "localhost",
+        "port": 5678
+      },
+      "justMyCode": false
+    }
+  ]
+}
+
+```
+
+Run using
+
+```sh
+python -m debugpy --listen 5678  -m ...
+```
diff --git a/docs/source/getting_started/installation.md b/docs/source/getting_started/installation.md
@@ -5,12 +5,52 @@ installation of the plugin and its dependencies. `uv` provides advanced
 dependency resolution which is required to properly install dependencies like
 `vllm` without overwriting critical dependencies like `torch`.
 
-```bash
-# Install uv
-pip install uv
-
-# Install vllm-spyre
-git clone https://github.com/vllm-project/vllm-spyre.git
-cd vllm-spyre
-VLLM_TARGET_DEVICE=empty uv pip install -e .
-```
+1. Clone vllm-spyre
+
+   ```sh
+   git clone https://github.com/vllm-project/vllm-spyre.git
+   cd vllm-spyre
+   ```
+
+1. Install uv
+  
+   ```sh
+   pip install uv
+   ```
+  
+1. Create a new env
+
+   ```sh
+   uv venv --python 3.12 --seed .venv
+   ```
+
+1. Activate it
+  
+   ```sh
+   source .venv/bin/activate
+   ```
+
+1. Install `vllm-spyre` locally with dev (and optionally lint) dependencies
+  
+   ```sh
+   uv sync --frozen --active --inexact
+   ```
+  
+   or also with lint:
+  
+   ```sh
+   uv sync --frozen --active --inexact --group lint
+   ```
+
+   :::{tip}
+   `--group dev` is enabled by default
+   :::
+
+1. (Optional) Install torch through pip
+  
+   If you don't have it installed already. Will be needed
+   for running examples or tests.
+  
+   ```sh
+   pip install torch==2.7.0
+   ```
diff --git a/examples/offline_inference/offline_inference_multi_spyre.py b/examples/offline_inference/offline_inference_multi_spyre.py
@@ -1,11 +1,19 @@
 import gc
 import os
+import platform
 import time
 
 from vllm import LLM, SamplingParams
 
 max_tokens = 3
 
+if platform.machine() == "arm64":
+    print("Detected arm64 running environment. "
+          "Setting HF_HUB_OFFLINE=1 otherwise vllm tries to download a "
+          "different version of the model using HF API which might not work "
+          "locally on arm64.")
+    os.environ["HF_HUB_OFFLINE"] = "1"
+
 os.environ["VLLM_SPYRE_WARMUP_PROMPT_LENS"] = '64'
 os.environ["VLLM_SPYRE_WARMUP_NEW_TOKENS"] = str(max_tokens)
 os.environ['VLLM_SPYRE_WARMUP_BATCH_SIZES'] = '1'
diff --git a/examples/offline_inference/offline_inference_spyre.py b/examples/offline_inference/offline_inference_spyre.py
@@ -1,10 +1,18 @@
 import os
+import platform
 import time
 
 from vllm import LLM, SamplingParams
 
 max_tokens = 3
 
+if platform.machine() == "arm64":
+    print("Detected arm64 running environment. "
+          "Setting HF_HUB_OFFLINE=1 otherwise vllm tries to download a "
+          "different version of the model using HF API which might not work "
+          "locally on arm64.")
+    os.environ["HF_HUB_OFFLINE"] = "1"
+
 os.environ["VLLM_SPYRE_WARMUP_PROMPT_LENS"] = '64'
 os.environ["VLLM_SPYRE_WARMUP_NEW_TOKENS"] = str(max_tokens)
 os.environ['VLLM_SPYRE_WARMUP_BATCH_SIZES'] = '1'
diff --git a/examples/offline_inference/offline_inference_spyre_cb_test.py b/examples/offline_inference/offline_inference_spyre_cb_test.py
@@ -1,4 +1,5 @@
 import os
+import platform
 import time
 
 from vllm import LLM, SamplingParams
@@ -11,6 +12,13 @@
 max_tokens3 = 7
 max_num_seqs = 2  # defines max batch size
 
+if platform.machine() == "arm64":
+    print("Detected arm64 running environment. "
+          "Setting HF_HUB_OFFLINE=1 otherwise vllm tries to download a "
+          "different version of the model using HF API which might not work "
+          "locally on arm64.")
+    os.environ["HF_HUB_OFFLINE"] = "1"
+
 # defining here to be able to run/debug directly from VSC (not via terminal)
 os.environ['VLLM_SPYRE_DYNAMO_BACKEND'] = 'eager'
 os.environ['VLLM_SPYRE_USE_CB'] = '1'