Overview
Fault Diagnosis and Recovery is designed to promptly detect anomalies in computing resources, rapidly identify the type and root cause of issues, and restore computing power through automated or manual intervention to ensure business continuity and stability.
This document provides a detailed introduction to the fault diagnosis mechanism provided by the TI-ONE platform and lists typical abnormal scenarios and their handling solutions. Additionally, it provides guidance on configuring alert rules to enable timely notifications and minimize operational risks.
Fault Diagnosis Mechanism
The platform offers both automated and manual diagnostic methods, providing comprehensive coverage for various ops scenarios.
1. Auto-Diagnosis
CVM Auto-Diagnosis: Cloud Virtual Machine supports automatic detection of CVM instance anomalies and proactively initiates maintenance tasks.
TI-ONE Auto-Detection: TI-ONE platform supports scheduled detection of node availability status in the backend to prevent potential failures.
2. Manual Diagnosis
TI-ONE Sanity Check: TI-ONE platform resource groups provide the feature to manually create Sanity Check tasks, supporting active troubleshooting of node network connectivity and environment consistency.
CVM Auto-Diagnosis
CVM Auto-Diagnosis is a standardized fault handling service provided by Cloud Virtual Machine. When CVM detects sudden exceptions in an instance (such as unexpected downtime of the underlying host or proactive prediction of hardware and software failure risks to avoid downtime in advance), it will automatically create corresponding maintenance tasks and send notifications. You may view and monitor the instance recovery status in the CVM console under the maintenance task list. For details on the exception types that trigger maintenance tasks, their specific meanings, and handling strategies, refer to Maintenance Task Types and Handling Suggestions. Note:
If there are maintenance tasks issued by CVM on the resource group CVM instances added to the TI-ONE platform, the platform will change the node status to "Pending Maintenance" and guide you to the CVM console to complete authorization. For the processing method, see CVM Maintenance Tasks on Node.
TI-ONE Auto-Diagnosis
TI-ONE auto-diagnosis is a fault diagnostic capability provided by TI-ONE resource group. It continuously monitors node status and component availability during node addition, management, and release operations. When issues such as GPU disconnections, XID errors, or VPC network unreachability are detected, the system leverages diagnostic tools to pinpoint the exact cause and automatically categorizes them into three distinct handling approaches as outlined in the table below. Processing Method | Description | Example |
TI-ONE closed-loop processing | It means the "diagnosis-troubleshooting-recovery" end-to-end process is completed by the TI-ONE platform, with the console only showing essential information (including exception causes and error messages). | When adding a CVM node to the resource group, the addition failed due to failure to create a TKE cluster. |
TI-ONE & user co-processing | Indicates that upon detecting an anomaly, the TI-ONE platform will notify the user for intervention. After guiding the user through required actions (such as authorization), the platform proceeds with troubleshooting and resolution. | Nodes added to the resource group require user authorization after a maintenance task is issued by CVM, and TI-ONE & CVM will restore the node. |
User self-service handling | Indicates that the TI-ONE platform provides users with detected anomaly information and guides them through self-service troubleshooting steps. | The CVM computing cost or software subscription fee for the node will expire soon, users are advised to renew via self-service. |
TI-ONE Sanity Check
TI-ONE Health Check is a Proactive fault diagnosis capacity provided by TI-ONE resource groups. It supports actively initiating Sanity Check tasks after node addition to verify key parameters for nodes such as network communication performance and environment consistency, ensuring they meet operation requirements. By detecting node availability in advance, you can avoid various operational issues caused by node faults, for example:
Avoid GPU resource waste: When tasks or services enter a continuous retry loop due to node faults (such as driver anomalies), they cannot be executed normally even after completing time-consuming initialization operations like model loading. This not only occupies valuable GPU resources but also requires additional time for fault troubleshooting and task resubmission. Health checks can identify and exclude faulty nodes before task initiation, thereby ensuring efficient resource utilization.
Prevent task/service performance damage: During operation, slow network communication and other issues may lead to low training efficiency or online service response delays. Since such performance bottlenecks are hard to predict before task initiation, they are often discovered only after causing substantial impact. Health checks can expose potential performance defects in advance, preventing tasks from starting on unhealthy nodes and ensuring business performance at the source.
The following will introduce in detail how to create a healthy detect task within resource group.
1. Create a Sanity Check task
Enter the Node Management tab in the resource group details page, select the node scope to detect, and click the Sanity Check button in the upper right corner. In the popup, select the detection item to complete the creation.
2. View task records and results
Switch to the "Sanity Check Records" tab to view task execution history and detection result logs.
Typical Scenarios and Handling Solutions
TI-ONE Closed-Loop Processing
The following table lists over ten abnormal scenarios automatically diagnosed and fixed by TI-ONE. When platform detection identifies such abnormality, it will automatically trigger the workflow and coordinate relevant resources to proceed with the necessary fixes without the need for manual intervention. Please wait patiently. If the issue remains unresolved, you can submit a ticket to contact us for further support.
Trigger Stage | Node Status | Exception Cause | Abnormal Scenario |
Add nodes | Failed to deploy | TKE cluster creation failure | Due to TKE version limits, the API call to create a cluster failed. Due to unsynchronized TKE API updates, adding agent nodes to the cluster failed. Due to disabled node registration capability or bugs in TKE, node registration failed in the TKE cluster. |
|
| Unable to add nodes to the TKE cluster | Due to bugs in the registration script or incompatibility with new OS versions, the script execution failed. Due to CVM operating system updates, the existing operating system image is not supported, causing the API call to install the operating system to fail. Due to read-only permission on the CVM node disk, the registration script execution failed. |
|
| Initialization failure of TKE cluster | Due to GPU/RDMA failure on the CVM node, the TKE cluster system components are in pending state. Due to network anomalies on the CVM node, the system service is inaccessible, causing the TKE cluster system components to be in crash state. Since TKE is not adapted to the GPU model of the CVM node, resources such as qgpu, rdma, and eni are 0. Turbocfs component status exception. Since turbocfs is not adapted to the latest OS, the storage component on the CVM node is unavailable. |
Using nodes | Exception | Node is not in "Running" status | Since the CVM node failed to start/was taken out of service, it is unable to ping or ssh. Since the key service component initialization failed after the CVM node started, it caused resource failure (qgpu/rdma/eni resources are 0). |
|
| NPD detected anomalies in the node | Due to GPU/RDMA/CPU/memory/local disk/OS/K8S component object failures, the CVM node is unavailable. |
|
| Abnormal system components in the TKE cluster | Due to RDMA network card downtime, the RDMA/GPU components of the CVM node are in crash state. Due to memory leaks in the operating system, the CVM node is unavailable. |
|
| qGPU component status exception | Since qGPU is not adapted to the corresponding GPU card model or the driver version exceeds 550, the qGPU component of the CVM node is unavailable. |
For the above scenario, the TI-ONE console will display necessary prompts, including failure reason and specific exception information.
TI-ONE & User Co-Processing
The following table lists four abnormal scenarios handled jointly by TI-ONE and users. When such abnormality is detected, the platform will clarify the cause and guide you to the specified interface to complete necessary fault troubleshooting or repair operations.
Trigger Stage | Node Status | Exception Cause | Abnormal Scenario |
Add nodes | Failed to deploy | Unable to add nodes to the TKE cluster | Due to network restrictions in the user's VPC, the node cannot be connected. |
|
|
| Due to residual historical lvm volumes on the CVM node, the registration script execution failed. |
Using nodes | Under maintenance | There are maintenance tasks issued by CVM on the node. | Since the CVM proactively detects node exceptions and initiates repair tasks (e.g., faulty hardware such as CPU/hard disk/motherboard/network interface card, or operation exceptions like GPU/network interface card). For details, see CVM Auto-Diagnosis. |
| Exception | Network connection exception | Due to security group/route configuration errors in the user's VPC, the node cannot be connected. |
Handling solution for various exceptions
VPC Network Restrictions Lead to Deployment Failure or Connection Exception
CVM Repair Task on the Node
Historical Residual LVM Volumes on the Node
Solution:
2. Create Instance: Click the Create button to create a CCN instance. See the document for details create CCN instance. 3. Associate VPC: Enter the CCN instance details page, click the Add Instance button in the top-left corner of the list under the Associated Instance tab, then select the network instance type, region, and specific VPC instance that need to be associated in the popup. See the document for details associate network instances. (Remark: The VPC instances that need to be added are the "Platform VPC" and "Node VPC" displayed in the console.) 4. Check the route table: View whether the routing policy of each subnet under the VPC associated with CCN takes effect in the route table tab. If the network segments of the associated network instance have conflicts, invalid routes will occur.
Solution:
1. Click the Authorize button in the prompt message, go to the maintenance task list in the CVM console, and find the task information corresponding to the CVM instance. 2. Click the Authorize/Reserve button on the right side of the list, select the specific authorization maintenance method and reserve the maintenance time in the popup. Click Confirm to complete the authorization.
Solution:
1. Click the CVM console button in the prompt message, go to the CVM console, and find the corresponding machine in the instance list. 2. Go to the instance details page, manually click the Uninstall button in the mounted data disk list to complete cleanup.
User Self-Service Processing
The following table lists three common scenarios where users can perform self-service handling. When the platform detects these anomalies, it provides clear anomaly explanations and handling guidelines. You can refer to the Note on the specified page to independently complete the repair operation and quickly restore business functioning.
Trigger Stage | Node Status | Exception Cause | Abnormal Scenario |
Using nodes | Exception | Node is not in "Running" status | Due to proactive node restart/shutdown operations by the user in the CVM console or node expiration, the node is unavailable. |
| Running. | Node will expire soon | The computing cost of CVM instances will expire soon, users are advised to renew via self-service. The software subscription fee for the TI-ONE resource group node will expire soon, users are advised to renew via self-service. |
Handling solution for various exceptions
Node Is Not in "Running" Status
Solution:
1. Click the CVM console button in the prompt message, go to the CVM console, and find the corresponding machine in the instance list. 2. Confirm if the status of this instance is "restarting/is shut down/to be recycled".
2.1 If the machine is in the above status and confirmed to no longer be used, go to the instance recycle bin and directly click the release button. 2.2 If you need to continue using it, click the renew or restore button to underwrite instance availability.
Handling solution 1: CVM computing cost is about to expire
1. Click the CVM console button in the prompt message, go to the CVM console, and find the corresponding machine in the instance list. 2. Click the renew button on the right, select the renewal duration, then click confirm.
Handling solution 2: TI-ONE software subscription fee is about to expire
Click on the right side of the list Renew button, select renewal duration and click Confirm just.
Alarm Configuration
During maintenance of the resource group, the platform has integrated monitoring of node state changes into the alarm rules of Tencent Cloud Observability Platform (TCOP). When a node enters key lifecycle statuses such as "abnormal", "under maintenance", or "isolated", the system will proactively push alarms to help you promptly grasp resource changes and guarantee Ops efficiency. Currently covered Alarm trigger status includes: anomaly, purchase status, maintenance, pending maintenance, running, terminated, isolated.
The steps to configure alarm rules are as follows:
2. On the Create Alarm Policy webpage, fill in the policy name and description, then configure alarm rules with the following parameters:
2.1 Monitoring type: Select "cloud service monitoring";
2.2 Policy type: Select "Tencent Cloud TI Platform TI-ONE / Resource / Resource Status";
2.3 Alarm object: Select as needed "specified instance" (resource group node) or "all objects";
2.4 Trigger condition: Manually configure, select the node status that requires Alarm in the "field" field, such as "instance anomaly".
3. Click Next: Configure Alarm Notification, select a notification template, then click Complete.