Hosting a Hugging Face Model

Aquanode simplifies the deployment and serving of Hugging Face models with the vLLM inference engine, enabling you to host large language models (LLMs) and text-to-text models with optimized GPU efficiency and lightning-fast API responses.

Navigate to Model Pipelines

In the Aquanode Console, go to:
Workloads → Model Pipelines
Select the Serverless vLLM option.

Direct link: Model Pipelines Console

Configure Your Model

Fill in the required fields:

Model Repository URL → The Hugging Face model repo (e.g., huggingface/transformers, meta-llama/Llama-2-7b-chat-hf)
HF Token → Your Hugging Face access token (required for gated/private models)
API Key → Custom API key that will be required to access your deployment
Additional Settings → (Optional) batch size, max tokens, or other runtime configurations

This ensures Aquanode can fetch your model and set up the inference server.

Select Resources

Choose hardware suitable for your model:

GPU → Select from available options (H100, A100, B200, RTX series). Larger models (13B, 70B) require high-memory GPUs.
Memory & Storage → Aquanode will recommend defaults, but you can adjust for your workload.

Tip: For LLaMA 2–7B or Falcon-7B models, A100 80GB is usually sufficient. For larger models (13B+), choose H100 or B200 with higher memory.

Deploy

Click Deploy.
Your deployment may take a few minutes while Aquanode:

Pulls the Hugging Face model weights
Sets up the vLLM runtime
Allocates your selected GPU

Once active, your service will appear in the Active Deployments list.

Use the API

After deployment, you will receive a unique endpoint URL.

You can make requests to it using your API key:

curl -X POST "https://api.aquanode.io/v1/deployments/<deployment-id>/predict" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Write me a haiku about Aquanode and GPUs"
  }'

The service will return a JSON response with your model’s output.

Notes & Best Practices

Ensure your Hugging Face token has read permissions for gated/private models.
Use GPUs with sufficient memory to avoid out-of-memory errors.
For production use, monitor GPU utilization and autoscale when needed.
Secure your API key, treat it like a password.

Next Steps

Use Snapshots to save your environment for reuse.
Combine multiple Hugging Face deployments with other Aquanode services for end-to-end ML workflows.

🎉 You’re all set! Your Hugging Face model is now live on Aquanode, powered by vLLM and GPU acceleration.