tencent cloud

Tencent Cloud TI Platform

Related Agreement
개인 정보 보호 정책
데이터 처리 및 보안 계약
문서Tencent Cloud TI Platform

Built-In Large Model Inference Image Usage Instructions

포커스 모드
폰트 크기
마지막 업데이트 시간: 2026-01-16 15:25:41

Introduction to Inference Image

TI - ONE training platform has built - in general inference images for large models. It supports the running and reasoning of LLM in huggingface format. It provides an HTTP service compatible with OpenAI Interface Format for text dialogue and supports Tencent Self - developed acceleration capability for large model inference.

Currently, two large model inference mirrors are preset:
[Recommended] angel-vllm: An optimized version of the open-source vllm by the TI-ONE team. It supports online int8, nf4, fp8 weight-only quantization and multiple quantization acceleration methods such as LayerwiseSearchSMQ. Recommended for use in production environments with high concurrency requirements.
angel-deepspeed: An optimized version of the open-source deepspeed by the TI-ONE team. It supports multiple optimization technologies such as operators, communication, and quantization;

This document will introduce how to use the built-in large model inference mirror of TI-ONE to start an inference service that provides a large model dialogue API.


Model Directory Preparation

First, you need to place the model to be inferred into your own CFS file system. The model directory structure is the same as that of the Text Generation type model on huggingface. The model format supports .bin format and .safetensors format.
Note:
The angel-vllm mirror also supports reasoning using the tilearn-llm library to perform LayerwiseSearchSMQ quantization on the obtained model. Please confirm that the model directory contains the smoothq_model-8bit-auto.safetensors file, and it is recommended to delete the original non-quantized model file.
If you wish to use the LayerwiseSearchSMQ quantization acceleration technology provided by TI-ONE, please submit a ticket for feedback.


Model Configuration File

To enable the inference image to better detect which model you are using for inference, it is recommended to save an additional ti_model_config.json configuration file in the model directory. The format is as follows:
{"model_id": "Baichuan2-13B-Chat", "conv_template_name": "baichuan2-chat"}
Field meaning
model_id
conv_template_name
Meaning: model name
Recommended value: If you are going to run an open-source model on huggingface, it is recommended to directly specify the model ID on huggingface, such as "Baichuan2-13B-Chat", "Qwen1.5-14B-Chat". Whether the model name contains a slash "/", and the case sensitivity of the text do not affect the actual usage.
Other specified methods: If you do not want to add the ti_model_config.json file to the model directory, you can also specify it by using the MODEL_ID environment variable when starting the service.
Meaning: dialogue template name
Recommended value: If your model_id is the same as the open-source model, and you have not modified the model's dialogue template during the SFT stage, you do not need to add the conv_template_name parameter. The inference image will attempt to automatically match; if you are certain about which dialogue template the model uses, it is recommended that you manually specify the dialogue template.
Other specified methods: If you do not want to add the ti_model_config.json file to the model directory, you can also specify it by using the CONV_TEMPLATE environment variable when starting the service.
The currently commonly used dialogue template names are shown in the table below. For the remaining current supported dialogue template list of the mirror, see: conversation.py.
Dialogue Template Name
Supported Model Series
generate
Non-dialogue model (generate directly, no dialogue template)
shennong_chat
Tencent Industry Large Model (the industry large model needs to enable the allowlist)
llama-3
llama-3-8b-instruct, llama-3-70b-instruct models
llama-2
llama-2-chat series models
qwen-7b-chat
qwen-chat series models (chatml format)
baichuan2-chat
baichuan2-chat series models
baichuan-chat
baichuan-13b-chat model
chatglm3
chatglm3-6b model
chatglm2
chatglm2-6b model

Starting Inference Service

You can start the inference service through the Model Service > Online Service > Create Service entry on Tencent Cloud TI Platform. The following is the guide for service instance configuration.

Model Source and Execution Environment

For the source of large model inference service model, please select [CFS]. Directly fill in the path as the preparation of model directory in the previous context on CFS.
For the execution environment, please select [Built-in / LLM / angel-vllm] or [Built-in / LLM / angel-deepspeed]. The recommended choice is angel-vllm.

Resource Application Recommendations

The machine resources required for large model inference are related to the parameter quantity of the model. We recommend configuring the inference service resources according to the following rules (for CPU and memory, it is recommended to proportionally distribute according to the actual available resources of the node and the number of occupied cards. It is recommended that the memory be allocated at least more than the video memory of the graphics card. Otherwise, an OOMKilled exception may easily occur).
Number of Model Parameters
GPU Card Type and Quantity
6 ~ 8B
L20 * 1 / A10 * 1 / A100 * 1 / V100 * 1
12 ~ 14B
L20 * 1 / A10 * 2 / A100 * 1 / V100 * 2
65 ~ 72B
L20 * 8 / A100 * 8
Note:
Some models support long contexts such as 32k and 128k. The GPU VRAM required for KV cache will be large. If you don't need to support such long contexts during reasoning, you can manually reduce the context length;
Note:
If the GPU VRAM is insufficient and cannot start, try several ways to resolve:
1. Assign a larger number of GPUs to the service or replace with a GPU Model with larger gpu vram;
2. [Recommended] Try to enable int8 weight-only quantization, which can halve the video memory required for model weights, such as QUANTIZATION=ifq.
3. [0]Recommendation: Try reducing the context length. This can reduce the GPU VRAM required for the KV cache, such as MAX_MODEL_LEN=4096.
4. Attempt to adjust the gpu vram reservation allocation ratio, such as GPU_MEMORY_UTILIZATION=0.95;
5. Try using eager mode and add --enforce-eager to the startup parameters;
See the meanings of specific parameters in the advanced settings of the angel-vllm mirror.

angel-vllm Mirror Advanced Setting

Most of the time, you can directly start the mirror without any modification. If you have further needs for advanced settings, you can adjust the startup parameters of the container through [Start Command] or [Environment Variable].
The default startup command of the mirror is
run
The currently supported startup parameters and environment variables are as follows (the startup parameters and environment variables in the same row in the following table are equivalent. If set simultaneously, the parameters of the startup command have higher priority):
Startup Parameter
Environment Variable
Default Value
Meaning
None
MODEL_ID
None
Model name, see the introduction of model_id in model configuration file
None
CONV_TEMPLATE
None
Dialogue template name, see the introduction of conv_template_name in model configuration file
--model-path
MODEL_PATH
/data/model
The model load path, the specified CFS path during startup service will be auto mounted to the "/data/model" path in containers by default, you need not modify in most cases.
--num-gpus
TP
The number of video cards allocated to the container
The number of graphics cards for model parallel reasoning. It is used when the video memory of a single graphics card is insufficient. Multiple graphics cards are used to load the model. By default, it is the number of GPU resources allocated to a single container for you.
Note:
The angel-vllm (2.0) image supports distributed inference. When you select multi-machine distributed deployment, it will be automatically set as the total number of cards in the distributed inference cluster.
--worker-num
WORKER_NUM
1
The number of reasoning Workers, which impacts the number of processes running the inference service in a single container. It defaults to 1. Generally, the number of GPUs allocated to a container is equal to TP * WORKER_NUM.
--limit-worker-concurrency
MAX_CONCURRENCY
128
The maximum number of concurrent requests supported by each reasoning Worker. When the number of requests being processed exceeds this value, new requests will be queued.
--dtype
DTYPE
float16
Model precision type, available values: ["auto", "float16", "bfloat16", "float32"]. If "auto" is configured, it will be configured automatically based on the precision used during model training.
--seed
SEED
None
The random seed generated each time is not specified by default.
--max-model-len
MAX_MODEL_LEN
Model context length
The maximum number of context tokens supported by the reasoning service. The context length of model configuration information is automatically read by default. If the gpu vram is insufficient when loading some long-context models according to the default configuration, you can manually set this parameter to reduce this value.
--quantization
QUANTIZATION
none
Quantization acceleration mode, available values: ["none", "ifq", "ifq_nf4", "fp8", "smoothquant", "auto"]
"none": means disabling quantization acceleration;
"ifq": means enabling online Int8 Weight-Only quantization, which can accelerate inference with almost no loss in effect and reduce the gpu vram occupancy of model weights;
"ifq_nf4": means enabling online NF4 Weight-Only quantization. 4-bit quantization can further accelerate reasoning and reduce the gpu vram occupancy of model weights; [angel-vllm (2.0) mirror starts to support]
"fp8": Enable online FP8 Weight-Only quantization, which can further accelerate reasoning and reduce GPU VRAM usage of model weights; [angel-vllm (2.0) mirror starts support][Only NVIDIA Hopper series models are supported]
"smoothquant": means enabling LayerwiseSearchSMQ quantization, which can further accelerate reasoning with a slight loss in effect (dependent on preparing the quantized model file in advance; currently, only some models support it);
"auto": means automatically determining the quantization mode, among them:
If the video card of the model does not support quantization, quantization will be automatically shut down;
If the model directory contains the smoothq_model-8bit-auto.safetensors file, LayerwiseSearchSMQ quantization acceleration will be automatically enabled;
In other cases, online Int8 Weight-Only quantization acceleration (ifq) is enabled by default;
If you have high requirements for inference speed or tense VRAM resources, it is recommended to enable quantization acceleration.
--use-lookahead
USE_LOOKAHEAD
0
Lookahead parallel decoding acceleration, available values: ["0", "1", "true", "false"]. Set to "1" or "true" to enable Lookahead parallel decoding acceleration. Lookahead parallel decoding has better acceleration performance for generation in scenarios such as RAG where the input is relatively long and the generated content is included in the input text.
[angel-vllm (2.0) mirror starts support]
--num-speculative-tokens
NUM_SPECULATIVE_TOKENS
8
The number of tokens for parallel decoding takes effect when parallel decoding is enabled. It indicates the length of a single parallel decoding. For single concurrency, it is recommended to set it to 12. When the BatchSize is large, the decoding length for a single time can be appropriately reduced, such as 6 or 4.
[angel-vllm (2.0) mirror starts support]
--api-keys
OPENAI_API_KEYS
None
API_KEY list, default not set. The supported format is a comma-separated string, for example, "sk-xxxx,sk-yyyy,sk-zzzz". After setting, it means Bearer Token authentication is enabled for the service's API interface. When making a user request, you need to bring any one of the available API_KEYS as the access Token, such as adding -H "Authorization: Bearer sk-xxxx" in a curl request, or setting openai_API_KEY when using the openai sdk for calling. After enabling authentication, an HTTP 401 error will be returned if the authentication fails. <3>Please note: After enabling API_KEY authentication, <4>Online experience</4> <5>will no longer be supported</5>, and only <6>service invocation</6> is supported.
--max-num-batched-tokens
None
max(model context length, 2048)
The maximum number of tokens processed per iteration. The default value is the larger value between 2048 and the model's supported context length. If there is considerable surplus GPU VRAM and the input text is relatively long, you can increase this parameter appropriately to get better throughput performance.
--max-num-seqs
None
256
The maximum number of generated sequences processed per iteration
--gpu_memory_utilization
GPU_MEMORY_UTILIZATION
0.9
The reserved GPU memory ratio for model weights, activations, and KV cache. The larger the value, the larger the supported KV cache, but the easier it is to cause GPU memory overflow.
--enforce-eager
ENFORCE_EAGER
1
Whether to force enable Pytorch's eager mode. It is off by default. In this case, CUDA graph will be additionally used for further acceleration, but it will occupy additional GPU VRAM and increase some services' startup duration. If the startup reports insufficient GPU memory, you can try adding the --enforce-eager parameter in the startup command to save GPU memory usage, but the inference performance will slightly decline. [angel-vllm (2.0) mirror opens by default]
Meanwhile, the mirror also supports the vllm native entrypoint inference service startup method. You can modify the startup command to start with python3 -m vllm.entrypoints.openai.api_server. At this point, it supports various parameters native supported by the open source vllm 0.4.2 edition. For details, see vllm official documentation.

The platform mirror version additionally supports following parameters on this basis:
--use-lookahead: Enable lookahead parallel decoding. It has better acceleration effect for generation in scenarios such as RAG scenario where the input is relatively long and the generated content is included in the input text. --use-v2-block-manager needs to be enabled simultaneously.
--num-speculative-tokens: indicates the length of parallel decoding. For single concurrency, it is recommended to set it to 12. When BatchSize is large, the decoding length can be appropriately reduced, for example, to 6 or 4.
--quantization: Additional support for ifq, smoothquant quantization methods, see the mode description of quantization acceleration above.

For example:
python3 -m vllm.entrypoints.openai.api_server --model /data/model --served-model-name model --trust-remote-code --quantization ifq --use-v2-block-manager --use-lookahead --num-speculative-tokens 6


angel-deepspeed Mirror Advanced Setting

Most of the time, you can directly start the mirror without any modification. If you have further needs for advanced settings, you can adjust the startup parameters of the container through [Start Command] or [Environment Variable].
The default startup command of the mirror is
run
The currently supported startup parameters and environment variables are as follows (the startup parameters and environment variables in the same row in the following table are equivalent. If set simultaneously, the parameters of the startup command have higher priority):
Startup Parameter
Environment Variable
Default Value
Meaning
None
MODEL_ID
None
Model name, see the introduction of model_id in model configuration file
None
CONV_TEMPLATE
None
Dialogue template name, see the introduction of conv_template_name in model configuration file
--model-path
MODEL_PATH
/data/model
The model load path, the specified CFS path during startup service will be auto mounted to the "/data/model" path in containers by default, you need not modify in most cases.
--num-gpus
TP
1
Number of video cards for model parallel reasoning, defaults to 1.
--worker-num
WORKER_NUM
1
The number of reasoning Workers, which affects the number of processes running the inference service in a single container. It is generally modified only when the number of GPUs allocated to a container is larger than the set --num-gpus parameter.
--limit-worker-concurrency
MAX_CONCURRENCY
32
The maximum number of concurrent requests supported by each reasoning Worker. When the number of requests being processed exceeds this value, new requests will be queued.
--dtype
DTYPE
float16
Model precision type, available values: ["float16", "bfloat16", "float32"]. If "auto" is configured, it will be configured automatically based on the precision used during model training.
--seed
SEED
None
The random seed generated each time is not specified by default.
--ds-dtype
DS_DTYPE
float16
Precision type when using Angel-Deepspeed for acceleration, available values: ["float16", "bfloat16", "int8"]
--max-batch-size
MAX_BATCH_SIZE
16
The maximum number of requests for each batch when dynamically grouping Batches
--batch-wait-timeout
BATCH_WAIT_TIMEOUT
0.1
Maximum wait time (in seconds) for each batch when dynamically grouping Batches


Dialogue API Documentation

After the inference image container starts, it will by default listen on port 8501 of the container. You can directly enter the container for debugging through the Enter Container button in the Instance List of the online service.

If you need external access to the model's dialogue api, there are several ways:
1. 
Online experience
: Directly use the frontend online experience page to access. This method will transit content review and support built-in large models and customized large models;
2. 
Service invocation
: Refer to the Online Service Invocation document. Obtain the public network or VPC private network call address of the service on the Service Invocation tag page of the service (service invocation is not supported for inference services using the platform built-in large model; service invocation is supported for inference services using your own model and the built-in large model image). You can develop your large model application based on this API. If you have a content review requirement, you need to integrate it yourself. If you have high requirements for service invocation performance, it is recommended to use the VPC private network call method.

API Description

POST /v1/chat/completions
The dialogue API is basically compatible with the API format of OpenAI's Chat interface.

2. Input Parameter

HTTP Header Parameter
Name
Value
Content-Type
application/json
HTTP request body parameter
Parameter Name
Required
Type
Description
messages
Yes
Array of Message
Arrange the session content in the order of dialogue time.
max_tokens
No
Integer
Defaulting to the context length supported by the model, which indicates the longest token number of content that the model can generate, cannot exceed the context length supported by the model;
stream
No
Boolean
Default false, indicating non-streaming return; set to true indicates streaming return. If you are latency-sensitive to the first character return, it is recommended to use streaming for a better experience.
temperature
No
Float
Default 0.7, value range [0.0, 2.0], indicates sampling temperature, used to adjust the degree of random sampling from the generative model. A higher value makes the output more random, while a lower value makes it more concentrated and deterministic. Set to 0, it means using greedy sampling strategy, at this point the generation result has no randomness.
This parameter impacts the quality and variety of the reply generated by the model.
top_p
No
Float
Default value: 1.0. Value range: (0.0, 1.0]. It means including tokens whose sum of probabilities does not exceed top_p into the candidate list. It impacts the diversity of the generated text. The larger the value, the greater the diversity of the generated text.
This parameter impacts the quality and variety of the reply generated by the model.
top_k
No
Integer
Default: -1, indicating that top-k filtering sampling is disabled. When the value is larger than 0, filter the top-k tokens with the highest likelihood first and then use top_p sampling.
This parameter impacts the quality and variety of the reply generated by the model.
presence_penalty
No
Float
Default: 0.0. Value range: [-2.0, 2.0]. It indicates the existence of a penalty. When it is positive, it penalizes whether new tokens appear in the current text, increasing the probability of the model discussing new topics.
This parameter impacts the quality and variety of the reply generated by the model.
frequency_penalty
No
Float
Default: 0.0. Value range: [-2.0, 2.0]. It indicates frequency penalty. When it is positive, it penalizes the occurrence count of new tokens in the current text, reducing the probability of the model repeating the same text.
This parameter impacts the quality and variety of the reply generated by the model.
repetition_penalty
No
Float
Default: 1.0. Value range: [1.0, 2.0]. It indicates the repetition penalty coefficient. When it is larger than 1, it will suppress the generation of repeated words and reduce the phenomenon of consecutive repetition.
This parameter impacts the quality and variety of the reply generated by the model.
stop
No
Array of String
The default is None, indicating no additional stop words; you can set one or more stop words according to actual business needs, and generation will automatically stop when a stop word is encountered.
ignore_eos
No
Boolean
false by default, which means not ignoring stop words. When set to true, it will forcibly generate tokens up to the number of max_tokens before stopping generation. It is generally only used during debugging (this parameter is only supported by the angel-vllm image).
seed
No
Integer
None by default, which means not setting a random seed. After being set to a specific value, this random seed will be used every time generation is performed. It is generally used to reproduce specific generation results (this parameter is only supported by the angel-vllm image).
Message
Parameter Name
Required
Type
Description
role
Yes
String
Corresponding role of the dialogue message. Currently supported values are as follows:
System: system prompt (appears once at the very beginning)
User: Refers to a user.
Assistant: dialogue assistant
user: Please translate the following terms related to Tencent Cloud into English. assistant: Create and bind a policy user: Continue. assistant: Query an instance user: What else? assistant: Reset the access password of an instance
content
Yes
String
The content of a dialogue message

3. Output Parameters

In a non-streaming situation, data with content-type: application/json will be returned. The format is as follows:
Parameter Name
Type
Description
id
String
Unique identifier for the dialogue
choices
Array of Choice
Generated result
created
Integer
Unix timestamp (in seconds) when the dialogue started
model
String
Model name used in the dialogue
object
String
Type of this object, fixed as chat.completion
usage
Usage
token count statistics
Number of tokens in the generated result: completion_tokens;
prompt token count in prompt content
total tokens in request (prompt content and generated result);
Streaming output will return data with content-type: text/event-stream; charset=utf-8 in SSE form.
The returned data for each line of streaming output is data: shard. The shard data format is as follows:
Parameter Name
Type
Description
id
String
Unique identifier for the dialogue
choices
Array of StreamChoice
Generated result
created
Integer
Unix timestamp (in seconds) when the dialogue started
model
String
Model name used in the dialogue
object
String
Type of this object, fixed as chat.completion.chunk
Streaming output will end with data: [DONE]. For details, see Example.

4. Example

Example 1 Non-Streaming Request

Input Example
curl -H "content-type: application/json" http://localhost:8501/v1/chat/completions -d '{"messages":[{"role": "user", "content": "hello"}], "temperature": 0.0}'
Output sample
Hello! What can I do for you?

Example 2 Streaming Request

Input Example
curl -H "content-type: application/json" http://localhost:8501/v1/chat/completions -d '{"messages":[{"role": "user", "content": "hello"}], "temperature": 0.0, "stream": true}'
Output sample
data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}

data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"content": "you"}, "finish_reason": null}]}

Okay.

data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"content": "what are my"}, "finish_reason": null}]}

{"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"content": "Can help"}, "finish_reason": null}]}

data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"content": "To yours"}, "finish_reason": null}]}

?

data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "object": "chat.completion.chunk", "created": 1698290857, "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

data: [DONE]

5. Developer Resources

Python SDK

You can directly use openai's sdk to make a call to this inference service. An example is as follows:
import os
from openai import OpenAI

# Example 1: Container local access address
os.environ["OPENAI_BASE_URL"] = "http://127.0.0.1:8501/v1"
# Example 2: Public network access address, which can be obtained through the service invocation page (add v1 after the address when using openai sdk)
os.environ["OPENAI_BASE_URL"] = "https://service-********.sh.tencentapigw.com:443/tione/v1"
# If the --api-keys parameter is set when the service starts or the OPENAI_API_KEYS environment variable enables authentication, OPENAI_API_KEY needs to be set to any one of the available api_keys; otherwise, it can be filled in arbitrarily.
os.environ["OPENAI_API_KEY"] = "EMPTY"
client = OpenAI()

# Non-streaming request example:
print("----- standard request -----")
completion = client.chat.completions.create(
model="model",
messages=[
{
"role": "user",
Hello
},
],
temperature=0.7,
top_p=1.0,
max_tokens=128,
)
print(completion.choices[0].message.content)
# Streaming request example:
print("----- streaming request -----")
stream = client.chat.completions.create(
model="model",
messages=[
{
"role": "user",
Hello
},
],
temperature=0.7,
top_p=1.0,
max_tokens=128,
stream=True,
)
for chunk in stream:
if not chunk.choices or not chunk.choices[0].delta.content:
continue
print(chunk.choices[0].delta.content, end="")
print()


Command Line Dialogue Demo

You can also use the commonly used python requests library to make a request to the dialogue API. Following is a Demo example of interacting with the large model inference service in the command line:
import argparse
import requests
import json


def chat(messages):
data = {
"messages": messages,
"temperature": args.temperature,
"max_tokens": args.max_tokens,
"top_p": args.top_p,
"stream": True, # Enable streaming output
}
header = {
"Content-Type": "application/json",
}
if args.token:
header["Authorization"] = f"Bearer {args.token}"

response = requests.post(f"{args.server}/v1/chat/completions", json=data, headers=header, stream=True) # Set stream=True to get real-time data stream
if response.status_code != 200:
print(response.json())
exit()

result = ""
print("Assistant: ", end = "", flush = True)
for part in response.iter_lines():
if part:
if "content" in part.decode("utf-8"):
content = json.loads(part.decode("utf-8")[5:])["choices"][0]["delta"]["content"] # String filtering. Convert data into json format and then extract text.
result += content
print(content, end = "", flush = True)
print("\\n")

return result


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Chat CLI Demo.")
parser.add_argument("--server", type=str, default="http://127.0.0.1:8501")
parser.add_argument("--max-tokens", type=int, default=512)
parser.add_argument("--temperature", type=float, default=0.7)
parser.add_argument("--top_p", type=float, default=1.0)
parser.add_argument("--system", type=str, default=None)
parser.add_argument("--token", type=str, default=None)
args = parser.parse_args()

messages = []
if args.system:
messages.append({"role": "system", "content": args.system})

while True:
user_input = input("User: ")
messages.append({"role": "user", "content": user_input})
response = chat(messages)
messages.append({"role": "assistant", "content": response})



도움말 및 지원

문제 해결에 도움이 되었나요?

피드백