GPT OSS 20B on Jetson - 1

GPT OSS 20B on Jetson

One-click SSH deployment of GPT OSS 20B on NVIDIA Jetson with a prebuilt Docker image.

Beginner10minLLM
JetsonLLMdockeredge-ai

What This Solution Does

Deploy GPT OSS 20B to an NVIDIA Jetson device with one click. The container starts llama-server and exposes an OpenAI-compatible HTTP API on port 8080.

Core Value

BenefitDetails
Local inferenceRun a 20B LLM entirely on edge hardware, no cloud dependency
OpenAI-compatible APIUse existing SDKs and tools without modification
One-click deploySSH-based remote deployment, no manual Docker commands

Use Cases

ScenarioHow to Use
Chat bot backendConnect as the AI engine for local chat applications
Voice assistantPair with a speech recognition frontend for offline voice AI
Multi-platform gatewayUse with OpenClaw to serve WeChat, Telegram, and other platforms

Usage Notes

Hardware Requirements:

  • Jetson Orin NX 16GB or higher (20B model requires ~12-15GB VRAM)
  • reComputer J4012 is verified; other Jetson Orin models should confirm sufficient VRAM

API Endpoint:

  • URL: http://<jetson-ip>:8080/v1/chat/completions
  • OpenAI-compatible format — works with existing SDKs
  • Python example: import openai; openai.api_base = "http://<jetson-ip>:8080/v1"

First Request Latency:

  • Initial request may take 2-5 minutes (model warm-up)
  • Check readiness at http://<jetson-ip>:8080/v1/models
  • After warm-up, subsequent requests typically respond in 1-3 seconds

Token & Context:

  • Default context window ~2048 tokens; adjustable during deployment
  • Larger context (Llama Context parameter) uses more VRAM
  • Keep single requests under 1000 tokens to avoid VRAM overflow

Integration Interfaces

http

OpenAI-compatible chat completion API

/v1/chat/completions · Port: 8080 · Method: POST
{"choices":[{"message":{"content":"response text"}}]}

Deployment Options

edge_device
Contact Us
We Are Glad to Be Your Hardware Partner !
Next