tencent cloud

Tencent Kubernetes Engine

소식 및 공지 사항
릴리스 노트
제품 릴리스 기록
제품 소개
제품 장점
제품 아키텍처
시나리오
제품 기능
리전 및 가용존
빠른 시작
신규 사용자 가이드
표준 클러스터를 빠르게 생성
Demo
클라우드에서 컨테이너화된 애플리케이션 배포 Check List
TKE 표준 클러스터 가이드
Tencent Kubernetes Engine(TKE)
클러스터 관리
네트워크 관리
스토리지 관리
Worker 노드 소개
Kubernetes Object Management
워크로드
클라우드 네이티브 서비스 가이드
Tencent Managed Service for Prometheus
TKE Serverless 클러스터 가이드
TKE 클러스터 등록 가이드
실습 튜토리얼
Serverless 클러스터
네트워크
로그
모니터링
유지보수
DevOps
탄력적 스케일링
자주 묻는 질문
클러스터
TKE Serverless 클러스터
유지보수
서비스
이미지 레지스트리
원격 터미널
문서Tencent Kubernetes Engine

TACO LLM Inference Acceleration Engine

포커스 모드
폰트 크기
마지막 업데이트 시간: 2025-04-30 16:03:16

1. Product Introduction

TACO-LLM (Tencent Cloud Accelerated Computing Optimization LLM) is an inference acceleration engine for large language models (LLMs) launched based on Tencent Cloud's heterogeneous Computing products to improve the inference efficiency of large language models. By fully leveraging the parallel Computing capabilities of Computing resources, TACO-LLM can process more LLM inference requests simultaneously, providing users with Optimization solutions that balance high throughput and low latency. TACO-LLM can reduce the waiting time for generation results, improve the efficiency of the inference process, and help you optimize business costs.
Advantages of TACO-LLM:
high ease of use
TACO-LLM is designed and implemented with a simple - to - use API, fully compatible with the open - source LLM inference framework vLLM in the industry. If you are using vLLM as an inference engine, you can seamlessly migrate to TACO-LLM and easily obtain better performance than vLLM. In addition, the simple and easy - to - use API of TACO-LLM enables users of other inference frameworks to quickly get started.
support for multiple computing platforms
TACO-LLM supports multiple computing platforms such as GPUs (Nvidia/AMD/Intel), CPUs (Intel/AMD), and TPUs, and will subsequently support major domestic computing platforms.
high efficiency
TACO-LLM uses multiple LLM inference acceleration technologies such as Continuous Batching, Paged Attention, speculative sampling, Auto Prefix Caching, CPU - assisted acceleration, and long - sequence optimization. It optimizes performance against different computing resources and all - round improves the performance of LLM inference computation.

2. Supported Models

TACO-LLM supports multiple generative Transformer models in Huggingface model format. The following lists the currently supported model architectures and corresponding commonly used models.

Decoder-Only Language Model

Architecture
Models
Example HuggingFace Models
LoRA
BaiChuanForCausalLM
Baichuan & Baichuan2
baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B, etc.
BloomForCausalLM
BLOOM, BLOOMZ, BLOOMChat
bigscience/bloom, bigscience/bloomz, etc.
-
ChatGLMModel
ChatGLM
THUDM/chatglm2-6b, THUDM/chatglm3-6b, etc.
FalconForCausalLM
Falcon
tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b, etc.
-
GemmaForCausalLM
Gemma
google/gemma-2b, google/gemma-7b, etc.
Gemma2ForCausalLM
Gemma2
google/gemma-2-9b, google/gemma-2-27b, etc.
GPT2LMHeadModel
GPT-2
gpt2, gpt2-xl, etc.
-
GPTBigCodeForCausalLM
StarCoder, SantaCoder, WizardCoder
bigcode/starcoder, bigcode/gpt_bigcode-santacoder, WizardLM/WizardCoder-15B-V1.0, etc.
GPTJForCausalLM
GPT-J
EleutherAI/gpt-j-6b, nomic-ai/gpt4all-j, etc.
-
GPTNeoXForCausalLM
GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
EleutherAI/gpt-neox-20b, EleutherAI/pythia-12b, OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.
-
InternLMForCausalLM
InternLM
internlm/internlm-7b, internlm/internlm-chat-7b, etc.
InternLM2ForCausalLM
InternLM2
internlm/internlm2-7b, internlm/internlm2-chat-7b, etc.
-
LlamaForCausalLM
Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
meta-llama/Meta-Llama-3.1-405B-Instruct, meta-llama/Meta-Llama-3.1-70B, meta-llama/Meta-Llama-3-70B-Instruct, meta-llama/Llama-2-70b-hf, 01-ai/Yi-34B, etc.
MistralForCausalLM
Mistral, Mistral-Instruct
mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1, etc.
MixtralForCausalLM
Mixtral-8x7B, Mixtral-8x7B-Instruct
mistralai/Mixtral-8x7B-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1, mistral-community/Mixtral-8x22B-v0.1, etc.
NemotronForCausalLM
Nemotron-3, Nemotron-4, Minitron
nvidia/Minitron-8B-Base, mgoin/Nemotron-4-340B-Base-hf-FP8, etc.
OPTForCausalLM
OPT, OPT-IML
facebook/opt-66b,
facebook/opt-iml-max-30b, etc.

PhiForCausalLM
Phi
microsoft/phi-1_5, microsoft/phi-2, etc.
Phi3ForCausalLM
Phi-3
microsoft/Phi-3-mini-4k-instruct,
microsoft/Phi-3-mini-128k-instruct,
microsoft/Phi-3-medium-128k-instruct, etc.
-
Phi3SmallForCausalLM
Phi-3-Small
microsoft/Phi-3-small-8k-instruct, microsoft/Phi-3-small-128k-instruct, etc.
-
PhiMoEForCausalLM
Phi-3.5-MoE
microsoft/Phi-3.5-MoE-instruct
, etc.
-
QWenLMHeadModel
Qwen
Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, etc.
-

Qwen2ForCausalLM
Qwen2
Qwen/Qwen2-beta-7B, Qwen/Qwen2-beta-7B-Chat, etc.
Qwen2MoeForCausalLM
Qwen2MoE
Qwen/Qwen1.5-MoE-A2.7B, Qwen/Qwen1.5-MoE-A2.7B-Chat, etc.
-
StableLmForCausalLM
StableLM
stabilityai/stablelm-3b-4e1t/ , stabilityai/stablelm-base-alpha-7b-v2, etc.
-
Starcoder2ForCausalLM
Starcoder2
bigcode/starcoder2-3b, bigcode/starcoder2-7b, bigcode/starcoder2-15b, etc.
-
XverseForCausalLM
Xverse
xverse/XVERSE-7B-Chat, xverse/XVERSE-13B-Chat, xverse/XVERSE-65B-Chat, etc.
-

Multimodal Language Model

Architecture
Models
Modalities
Example HuggingFace Models
LoRA
InternVLChatModel
InternVL2
Image(E+)
OpenGVLab/InternVL2-4B, OpenGVLab/InternVL2-8B, etc.
-
LlavaForConditionalGeneration
LLaVA-1.5
Image(E+)
llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, etc.
-
LlavaNextForConditionalGeneration
LLaVA-NeXT
Image(E+)
llava-hf/llava-v1.6-mistral-7b-hf, llava-hf/llava-v1.6-vicuna-7b-hf, etc.
-
LlavaNextVideoForConditionalGeneration
LLaVA-NeXT-Video
Video
llava-hf/LLaVA-NeXT-Video-7B-hf, etc. (see note)
-
PaliGemmaForConditionalGeneration
PaliGemma
Image(E)
google/paligemma-3b-pt-224, google/paligemma-3b-mix-224, etc.
-
Phi3VForCausalLM
Phi-3-Vision, Phi-3.5-Vision
Image(E+)
microsoft/Phi-3-vision-128k-instruct, microsoft/Phi-3.5-vision-instruct etc.
-
PixtralForConditionalGeneration
Pixtral
Image(+)
mistralai/Pixtral-12B-2409
-
QWenLMHeadModel
Qwen-VL
Image(E+)
Qwen/Qwen-VL, Qwen/Qwen-VL-Chat, etc.
-
Qwen2VLForConditionalGeneration
Qwen2-VL (see note)
Image(+) / Video(+)
Qwen/Qwen2-VL-2B-Instruct, Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2-VL-72B-Instruct, etc.
-
Note:
E: Pre-computed embeddings can serve as multimodal input.
+ : Indicates that a prompt can insert multiple multimodal inputs.

3. Installing TACO LLM

Environment Preparation

TACO-LLM relies on basic software related to GPU, such as GPU driver/CUDA, etc. To prevent basic software dependencies from preventing TACO-LLM from running normally, we provide a TACO-LLM docker environment image. It is recommended that you preferentially use this image as the runtime environment for TACO-LLM. You can obtain the docker image and start the container environment by the following commands:
docker run -it \\
--privileged \\
--net=host \\
--ipc=host \\
--shm-size=16g \\
--name=taco_llm \\
--gpus all \\
-v /home/workspace:/home/workspace \\
ccr.ccs.tencentyun.com/taco/tacollm-dev:latest /bin/bash

Installing Whl Package

Notes:
If you have any business requirements and need to try out TACO-LLM, submit a ticket to contact the TACO team to obtain the installation package.
1. After obtaining the TACO-LLM whl installation package by submitting a ticket, you can install TACO-LLM in the container environment with the following commands:
pip3 install taco_llm-${version}-cp310-cp310-linux_x86_64.whl
2. When installing the TACO-LLM whl package, related python dependency packages will be automatically installed.

4. Using TACO LLM

TACO-LLM provides an HTTP server to implement OpenAI Completions and Chat APIs. You can use it by following the steps below.

Start Service

First, execute the following commands to start the service:
taco_llm serve facebook/opt-125m --api-key taco-llm-test

Send the request

You can use OpenAI's official Python client to send a request:
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="taco-llm-test",
)

completion = client.chat.completions.create(
model="facebook/opt-125m",
messages=[
{"role": "user", "content": "Hello!"}
]
)

print(completion.choices[0].message)
You can also use an HTTP client to send a request:
import requests

api_key = "taco-llm-test"

headers = {
"Authorization": f"Bearer {api_key}"
}

pload = {
"prompt": "Hello!",
"stream": True,
"max_tokens": 128,
}

response = requests.post("http://localhost:8000/v1/completions",
headers=headers,
json=pload,
stream=True)

for chunk in response.iter_lines(chunk_size=8192,
decode_unicode=False,
delimiter=b"\\0"):
if chunk:
data = json.loads(chunk.decode("utf-8"))
output = data["text"][0]
print(output)

Complete Client Parameter Configuration

Except for a few unsupported parameters, TACO-LLM fully supports OpenAI's parameter configuration. You can refer to OpenAI API Official Documentation to view the complete API parameter configuration. The unsupported parameters are as follows:
Chat: tools, and tool_choice.
Completions: suffix.


도움말 및 지원

문제 해결에 도움이 되었나요?

피드백