개인 정보 보호 정책
데이터 처리 및 보안 계약
.bin format and .safetensors format.smoothq_model-8bit-auto.safetensors file, and it is recommended to delete the original non-quantized model file.ti_model_config.json configuration file in the model directory. The format is as follows:{"model_id": "Baichuan2-13B-Chat", "conv_template_name": "baichuan2-chat"}
ti_model_config.json file to the model directory, you can also specify it by using the MODEL_ID environment variable when starting the service.ti_model_config.json file to the model directory, you can also specify it by using the CONV_TEMPLATE environment variable when starting the service.Dialogue Template Name | Supported Model Series |
generate | Non-dialogue model (generate directly, no dialogue template) |
shennong_chat | Tencent Industry Large Model (the industry large model needs to enable the allowlist) |
llama-3 | llama-3-8b-instruct, llama-3-70b-instruct models |
llama-2 | llama-2-chat series models |
qwen-7b-chat | qwen-chat series models (chatml format) |
baichuan2-chat | baichuan2-chat series models |
baichuan-chat | baichuan-13b-chat model |
chatglm3 | chatglm3-6b model |
chatglm2 | chatglm2-6b model |
Number of Model Parameters | GPU Card Type and Quantity |
6 ~ 8B | L20 * 1 / A10 * 1 / A100 * 1 / V100 * 1 |
12 ~ 14B | L20 * 1 / A10 * 2 / A100 * 1 / V100 * 2 |
65 ~ 72B | L20 * 8 / A100 * 8 |
QUANTIZATION=ifq.MAX_MODEL_LEN=4096.GPU_MEMORY_UTILIZATION=0.95;--enforce-eager to the startup parameters;run
Startup Parameter | Environment Variable | Default Value | Meaning |
None | MODEL_ID | None | Model name, see the introduction of model_id in model configuration file |
None | CONV_TEMPLATE | None | Dialogue template name, see the introduction of conv_template_name in model configuration file |
--model-path | MODEL_PATH | /data/model | The model load path, the specified CFS path during startup service will be auto mounted to the "/data/model" path in containers by default, you need not modify in most cases. |
--num-gpus | TP | The number of video cards allocated to the container | The number of graphics cards for model parallel reasoning. It is used when the video memory of a single graphics card is insufficient. Multiple graphics cards are used to load the model. By default, it is the number of GPU resources allocated to a single container for you. Note: The angel-vllm (2.0) image supports distributed inference. When you select multi-machine distributed deployment, it will be automatically set as the total number of cards in the distributed inference cluster. |
--worker-num | WORKER_NUM | 1 | The number of reasoning Workers, which impacts the number of processes running the inference service in a single container. It defaults to 1. Generally, the number of GPUs allocated to a container is equal to TP * WORKER_NUM. |
--limit-worker-concurrency | MAX_CONCURRENCY | 128 | The maximum number of concurrent requests supported by each reasoning Worker. When the number of requests being processed exceeds this value, new requests will be queued. |
--dtype | DTYPE | float16 | Model precision type, available values: ["auto", "float16", "bfloat16", "float32"]. If "auto" is configured, it will be configured automatically based on the precision used during model training. |
--seed | SEED | None | The random seed generated each time is not specified by default. |
--max-model-len | MAX_MODEL_LEN | Model context length | The maximum number of context tokens supported by the reasoning service. The context length of model configuration information is automatically read by default. If the gpu vram is insufficient when loading some long-context models according to the default configuration, you can manually set this parameter to reduce this value. |
--quantization | QUANTIZATION | none | Quantization acceleration mode, available values: ["none", "ifq", "ifq_nf4", "fp8", "smoothquant", "auto"] "none": means disabling quantization acceleration; "ifq": means enabling online Int8 Weight-Only quantization, which can accelerate inference with almost no loss in effect and reduce the gpu vram occupancy of model weights; "ifq_nf4": means enabling online NF4 Weight-Only quantization. 4-bit quantization can further accelerate reasoning and reduce the gpu vram occupancy of model weights; [angel-vllm (2.0) mirror starts to support] "fp8": Enable online FP8 Weight-Only quantization, which can further accelerate reasoning and reduce GPU VRAM usage of model weights; [angel-vllm (2.0) mirror starts support][Only NVIDIA Hopper series models are supported] "smoothquant": means enabling LayerwiseSearchSMQ quantization, which can further accelerate reasoning with a slight loss in effect (dependent on preparing the quantized model file in advance; currently, only some models support it); "auto": means automatically determining the quantization mode, among them: If the video card of the model does not support quantization, quantization will be automatically shut down; If the model directory contains the smoothq_model-8bit-auto.safetensors file, LayerwiseSearchSMQ quantization acceleration will be automatically enabled;In other cases, online Int8 Weight-Only quantization acceleration (ifq) is enabled by default; If you have high requirements for inference speed or tense VRAM resources, it is recommended to enable quantization acceleration. |
--use-lookahead | USE_LOOKAHEAD | 0 | Lookahead parallel decoding acceleration, available values: ["0", "1", "true", "false"]. Set to "1" or "true" to enable Lookahead parallel decoding acceleration. Lookahead parallel decoding has better acceleration performance for generation in scenarios such as RAG where the input is relatively long and the generated content is included in the input text. [angel-vllm (2.0) mirror starts support] |
--num-speculative-tokens | NUM_SPECULATIVE_TOKENS | 8 | The number of tokens for parallel decoding takes effect when parallel decoding is enabled. It indicates the length of a single parallel decoding. For single concurrency, it is recommended to set it to 12. When the BatchSize is large, the decoding length for a single time can be appropriately reduced, such as 6 or 4. [angel-vllm (2.0) mirror starts support] |
--api-keys | OPENAI_API_KEYS | None | API_KEY list, default not set. The supported format is a comma-separated string, for example, "sk-xxxx,sk-yyyy,sk-zzzz". After setting, it means Bearer Token authentication is enabled for the service's API interface. When making a user request, you need to bring any one of the available API_KEYS as the access Token, such as adding -H "Authorization: Bearer sk-xxxx" in a curl request, or setting openai_API_KEY when using the openai sdk for calling. After enabling authentication, an HTTP 401 error will be returned if the authentication fails.
<3>Please note: After enabling API_KEY authentication, <4>Online experience</4> <5>will no longer be supported</5>, and only <6>service invocation</6> is supported. |
--max-num-batched-tokens | None | max(model context length, 2048) | The maximum number of tokens processed per iteration. The default value is the larger value between 2048 and the model's supported context length. If there is considerable surplus GPU VRAM and the input text is relatively long, you can increase this parameter appropriately to get better throughput performance. |
--max-num-seqs | None | 256 | The maximum number of generated sequences processed per iteration |
--gpu_memory_utilization | GPU_MEMORY_UTILIZATION | 0.9 | The reserved GPU memory ratio for model weights, activations, and KV cache. The larger the value, the larger the supported KV cache, but the easier it is to cause GPU memory overflow. |
--enforce-eager | ENFORCE_EAGER | 1 | Whether to force enable Pytorch's eager mode. It is off by default. In this case, CUDA graph will be additionally used for further acceleration, but it will occupy additional GPU VRAM and increase some services' startup duration. If the startup reports insufficient GPU memory, you can try adding the --enforce-eager parameter in the startup command to save GPU memory usage, but the inference performance will slightly decline. [angel-vllm (2.0) mirror opens by default] |
python3 -m vllm.entrypoints.openai.api_server. At this point, it supports various parameters native supported by the open source vllm 0.4.2 edition. For details, see vllm official documentation.--use-v2-block-manager needs to be enabled simultaneously.--num-speculative-tokens: indicates the length of parallel decoding. For single concurrency, it is recommended to set it to 12. When BatchSize is large, the decoding length can be appropriately reduced, for example, to 6 or 4.--quantization: Additional support for ifq, smoothquant quantization methods, see the mode description of quantization acceleration above.python3 -m vllm.entrypoints.openai.api_server --model /data/model --served-model-name model --trust-remote-code --quantization ifq --use-v2-block-manager --use-lookahead --num-speculative-tokens 6
run
Startup Parameter | Environment Variable | Default Value | Meaning |
None | MODEL_ID | None | Model name, see the introduction of model_id in model configuration file |
None | CONV_TEMPLATE | None | Dialogue template name, see the introduction of conv_template_name in model configuration file |
--model-path | MODEL_PATH | /data/model | The model load path, the specified CFS path during startup service will be auto mounted to the "/data/model" path in containers by default, you need not modify in most cases. |
--num-gpus | TP | 1 | Number of video cards for model parallel reasoning, defaults to 1. |
--worker-num | WORKER_NUM | 1 | The number of reasoning Workers, which affects the number of processes running the inference service in a single container. It is generally modified only when the number of GPUs allocated to a container is larger than the set --num-gpus parameter. |
--limit-worker-concurrency | MAX_CONCURRENCY | 32 | The maximum number of concurrent requests supported by each reasoning Worker. When the number of requests being processed exceeds this value, new requests will be queued. |
--dtype | DTYPE | float16 | Model precision type, available values: ["float16", "bfloat16", "float32"]. If "auto" is configured, it will be configured automatically based on the precision used during model training. |
--seed | SEED | None | The random seed generated each time is not specified by default. |
--ds-dtype | DS_DTYPE | float16 | Precision type when using Angel-Deepspeed for acceleration, available values: ["float16", "bfloat16", "int8"] |
--max-batch-size | MAX_BATCH_SIZE | 16 | The maximum number of requests for each batch when dynamically grouping Batches |
--batch-wait-timeout | BATCH_WAIT_TIMEOUT | 0.1 | Maximum wait time (in seconds) for each batch when dynamically grouping Batches |
/v1/chat/completionsName | Value |
Content-Type | application/json |
Parameter Name | Required | Type | Description |
messages | Yes | Array of Message | Arrange the session content in the order of dialogue time. |
max_tokens | No | Integer | Defaulting to the context length supported by the model, which indicates the longest token number of content that the model can generate, cannot exceed the context length supported by the model; |
stream | No | Boolean | Default false, indicating non-streaming return; set to true indicates streaming return. If you are latency-sensitive to the first character return, it is recommended to use streaming for a better experience. |
temperature | No | Float | Default 0.7, value range [0.0, 2.0], indicates sampling temperature, used to adjust the degree of random sampling from the generative model.
A higher value makes the output more random, while a lower value makes it more concentrated and deterministic. Set to 0, it means using greedy sampling strategy, at this point the generation result has no randomness. This parameter impacts the quality and variety of the reply generated by the model. |
top_p | No | Float | Default value: 1.0. Value range: (0.0, 1.0]. It means including tokens whose sum of probabilities does not exceed top_p into the candidate list.
It impacts the diversity of the generated text. The larger the value, the greater the diversity of the generated text. This parameter impacts the quality and variety of the reply generated by the model. |
top_k | No | Integer | Default: -1, indicating that top-k filtering sampling is disabled. When the value is larger than 0, filter the top-k tokens with the highest likelihood first and then use top_p sampling. This parameter impacts the quality and variety of the reply generated by the model. |
presence_penalty | No | Float | Default: 0.0. Value range: [-2.0, 2.0]. It indicates the existence of a penalty. When it is positive, it penalizes whether new tokens appear in the current text, increasing the probability of the model discussing new topics. This parameter impacts the quality and variety of the reply generated by the model. |
frequency_penalty | No | Float | Default: 0.0. Value range: [-2.0, 2.0]. It indicates frequency penalty. When it is positive, it penalizes the occurrence count of new tokens in the current text, reducing the probability of the model repeating the same text. This parameter impacts the quality and variety of the reply generated by the model. |
repetition_penalty | No | Float | Default: 1.0. Value range: [1.0, 2.0]. It indicates the repetition penalty coefficient. When it is larger than 1, it will suppress the generation of repeated words and reduce the phenomenon of consecutive repetition. This parameter impacts the quality and variety of the reply generated by the model. |
stop | No | Array of String | The default is None, indicating no additional stop words; you can set one or more stop words according to actual business needs, and generation will automatically stop when a stop word is encountered. |
ignore_eos | No | Boolean | false by default, which means not ignoring stop words. When set to true, it will forcibly generate tokens up to the number of max_tokens before stopping generation. It is generally only used during debugging (this parameter is only supported by the angel-vllm image). |
seed | No | Integer | None by default, which means not setting a random seed. After being set to a specific value, this random seed will be used every time generation is performed. It is generally used to reproduce specific generation results (this parameter is only supported by the angel-vllm image). |
Parameter Name | Required | Type | Description |
role | Yes | String | Corresponding role of the dialogue message. Currently supported values are as follows: System: system prompt (appears once at the very beginning) User: Refers to a user. Assistant: dialogue assistant user: Please translate the following terms related to Tencent Cloud into English. assistant: Create and bind a policy user: Continue. assistant: Query an instance user: What else? assistant: Reset the access password of an instance |
content | Yes | String | The content of a dialogue message |
content-type: application/json will be returned. The format is as follows:Parameter Name | Type | Description |
id | String | Unique identifier for the dialogue |
choices | Array of Choice | Generated result |
created | Integer | Unix timestamp (in seconds) when the dialogue started |
model | String | Model name used in the dialogue |
object | String | Type of this object, fixed as chat.completion |
usage | Usage | token count statistics Number of tokens in the generated result: completion_tokens; prompt token count in prompt content total tokens in request (prompt content and generated result); |
content-type: text/event-stream; charset=utf-8 in SSE form.data: shard. The shard data format is as follows:Parameter Name | Type | Description |
id | String | Unique identifier for the dialogue |
choices | Array of StreamChoice | Generated result |
created | Integer | Unix timestamp (in seconds) when the dialogue started |
model | String | Model name used in the dialogue |
object | String | Type of this object, fixed as chat.completion.chunk |
curl -H "content-type: application/json" http://localhost:8501/v1/chat/completions -d '{"messages":[{"role": "user", "content": "hello"}], "temperature": 0.0}'
Hello! What can I do for you?
curl -H "content-type: application/json" http://localhost:8501/v1/chat/completions -d '{"messages":[{"role": "user", "content": "hello"}], "temperature": 0.0, "stream": true}'
data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"role": "assistant"}, "finish_reason": null}]}data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"content": "you"}, "finish_reason": null}]}Okay.data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"content": "what are my"}, "finish_reason": null}]}{"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"content": "Can help"}, "finish_reason": null}]}data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {"content": "To yours"}, "finish_reason": null}]}?data: {"id": "chatcmpl-3fYW8fqN3YYMJiebiZgpzZ", "object": "chat.completion.chunk", "created": 1698290857, "model": "baichuan-13b-chat", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}data: [DONE]
import osfrom openai import OpenAI# Example 1: Container local access addressos.environ["OPENAI_BASE_URL"] = "http://127.0.0.1:8501/v1"# Example 2: Public network access address, which can be obtained through the service invocation page (add v1 after the address when using openai sdk)os.environ["OPENAI_BASE_URL"] = "https://service-********.sh.tencentapigw.com:443/tione/v1"# If the --api-keys parameter is set when the service starts or the OPENAI_API_KEYS environment variable enables authentication, OPENAI_API_KEY needs to be set to any one of the available api_keys; otherwise, it can be filled in arbitrarily.os.environ["OPENAI_API_KEY"] = "EMPTY"client = OpenAI()# Non-streaming request example:print("----- standard request -----")completion = client.chat.completions.create(model="model",messages=[{"role": "user",Hello},],temperature=0.7,top_p=1.0,max_tokens=128,)print(completion.choices[0].message.content)# Streaming request example:print("----- streaming request -----")stream = client.chat.completions.create(model="model",messages=[{"role": "user",Hello},],temperature=0.7,top_p=1.0,max_tokens=128,stream=True,)for chunk in stream:if not chunk.choices or not chunk.choices[0].delta.content:continueprint(chunk.choices[0].delta.content, end="")print()
import argparseimport requestsimport jsondef chat(messages):data = {"messages": messages,"temperature": args.temperature,"max_tokens": args.max_tokens,"top_p": args.top_p,"stream": True, # Enable streaming output}header = {"Content-Type": "application/json",}if args.token:header["Authorization"] = f"Bearer {args.token}"response = requests.post(f"{args.server}/v1/chat/completions", json=data, headers=header, stream=True) # Set stream=True to get real-time data streamif response.status_code != 200:print(response.json())exit()result = ""print("Assistant: ", end = "", flush = True)for part in response.iter_lines():if part:if "content" in part.decode("utf-8"):content = json.loads(part.decode("utf-8")[5:])["choices"][0]["delta"]["content"] # String filtering. Convert data into json format and then extract text.result += contentprint(content, end = "", flush = True)print("\\n")return resultif __name__ == "__main__":parser = argparse.ArgumentParser(description="Chat CLI Demo.")parser.add_argument("--server", type=str, default="http://127.0.0.1:8501")parser.add_argument("--max-tokens", type=int, default=512)parser.add_argument("--temperature", type=float, default=0.7)parser.add_argument("--top_p", type=float, default=1.0)parser.add_argument("--system", type=str, default=None)parser.add_argument("--token", type=str, default=None)args = parser.parse_args()messages = []if args.system:messages.append({"role": "system", "content": args.system})while True:user_input = input("User: ")messages.append({"role": "user", "content": user_input})response = chat(messages)messages.append({"role": "assistant", "content": response})
피드백