Hosting a Hugging Face Model
Aquanode simplifies the deployment and serving of Hugging Face models with the vLLM inference engine, enabling you to host large language models (LLMs) and text-to-text models with optimized GPU efficiency and lightning-fast API responses.
Navigate to Model Pipelines
- In the Aquanode Console, go to:
Workloads → Model Pipelines - Select the Serverless vLLM option.
Direct link: Model Pipelines Console
Configure Your Model
Fill in the required fields:
- Model Repository URL → The Hugging Face model repo (e.g.,
huggingface/transformers
,meta-llama/Llama-2-7b-chat-hf
) - HF Token → Your Hugging Face access token (required for gated/private models)
- API Key → Custom API key that will be required to access your deployment
- Additional Settings → (Optional) batch size, max tokens, or other runtime configurations
This ensures Aquanode can fetch your model and set up the inference server.
Select Resources
Choose hardware suitable for your model:
- GPU → Select from available options (H100, A100, B200, RTX series). Larger models (13B, 70B) require high-memory GPUs.
- Memory & Storage → Aquanode will recommend defaults, but you can adjust for your workload.
Tip: For LLaMA 2–7B or Falcon-7B models, A100 80GB is usually sufficient. For larger models (13B+), choose H100 or B200 with higher memory.
Deploy
Click Deploy.
Your deployment may take a few minutes while Aquanode:
- Pulls the Hugging Face model weights
- Sets up the vLLM runtime
- Allocates your selected GPU
Once active, your service will appear in the Active Deployments list.
Use the API
After deployment, you will receive a unique endpoint URL.
You can make requests to it using your API key:
curl -X POST "https://api.aquanode.io/v1/deployments/<deployment-id>/predict" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"inputs": "Write me a haiku about Aquanode and GPUs"
}'
The service will return a JSON response with your model’s output.
Notes & Best Practices
- Ensure your Hugging Face token has read permissions for gated/private models.
- Use GPUs with sufficient memory to avoid out-of-memory errors.
- For production use, monitor GPU utilization and autoscale when needed.
- Secure your API key, treat it like a password.
Next Steps
- Use Snapshots to save your environment for reuse.
- Combine multiple Hugging Face deployments with other Aquanode services for end-to-end ML workflows.
🎉 You’re all set! Your Hugging Face model is now live on Aquanode, powered by vLLM and GPU acceleration.