Product Feature Issues
1. What Is TSA-CFG?
Tencent Cloud Smart Advisor (TSA) is a cloud governance platform that provides a visualized cloud architecture IDE and multiple ITOM applications. Guided by the product philosophy of "One Platform with Multiple Applications", TSA leverages extensive expertise templates of Tencent Cloud to help you build an excellent architecture and achieve convenient, flexible one-stop cloud governance. TSA-Chaotic Fault Generator (TSA-CFG), a product within the TSA family, offers efficient, convenient, secure, and reliable fault experiment services based on a visualized cloud architecture. TSA-CFG helps you promptly identify potential disaster recovery risks in your business and validate the effectiveness of high-availability contingency plans, which ultimately improves the availability and resilience of your systems.
2. What Is a Visualized Experiment on Cloud Architecture?
The Chaotic Fault Generator (CFG) plugin based on TSA-Cloud Architecture is used to perform visualized chaos experiments on a cloud architecture.
When adding experiment objects, you can click graphic elements in the cloud architecture and select corresponding instance resources, which makes the blast radius clearer and more controllable. During an experiment, information such as the Virtual Private Cloud (VPC), availability zone (AZ), type, quantity, and upstream and downstream dependencies of each instance is visible. This allows you to better observe the fault propagation scope and monitor the action execution status and metrics for each instance in real time.
3. What Object Types Can TSA-CFG Perform Fault Injections On?
TSA-CFG can perform fault injection on Tencent Cloud object types such as Cloud Virtual Machine (CVM), Tencent Kubernetes Engine (TKE), TencentDB for MySQL, Tencent Cloud Distributed Cache (Redis ®* OSS-Compatible), NAT Gateway, Cloud Load Balancer (CLB), Direct Connect (DC), and Tencent Real-Time Communication (TRTC) to check system availability.
4. What Is an Industry Template Library?
To help users quickly reuse proven experiment solutions, TSA-CFG provides templates for multiple industries such as e-commerce, gaming, and multimedia. These templates cover various typical application scenarios, including cross-AZ disaster recovery experiments, hybrid cloud disaster recovery experiments, service stress experiments, and network fault experiments. To use a template, log in to the TSA console, choose Architecture Governance, and select Folder to go to the business architecture diagram page. Select Governance Mode at the top of the page and click CFG. Click Template Library in the menu below and click Open Template to browse information about templates in the industry template library. After you click Quickly Create Experiment, the system automatically imports the action orchestration solution and other templates into the creation form. To quickly create an experiment, you only need to select instance resources. This improves experiment efficiency.
Troubleshooting Experiment Action Execution Failures
1. How to Handle an Exception in Instance Lock Mechanism?
Action execution failed. The returned message indicates that the instance is being injected with faults by actions.
To prevent inaccurate observation of the state and performance of an instance after faults of the same type are injected into the same instance, the platform implements an instance lock mechanism. This ensures that only a single fault action can be executed on the same instance at any given time, while other actions are blocked. However, certain actions do not interfere with each other so that they do not compete for the same instance lock. Therefore, the platform distinguishes lock types based on the action type. For example, CPU-related fault actions and memory-related fault actions are allowed to execute simultaneously, but actions of the same type cannot be executed at the same time.
2. What Are the Common Causes for Execution Failures of CVM-related Fault Actions?
Stress testing action (such as actions to test high CPU utilization and high disk partition utilization) execution failed. The returned message indicates that stress-ng fails to be installed.
The operating system of the server does not support the installation. Please change the operating system according to the action requirements.
Network fault action execution failed. The message "Error: Exclusivity flag on, cannot modify." is returned.
The TC rule issued for the fault conflict with existing rules and cannot overwrite them. You can check whether any previous experiment actions have not been promptly restored, or modify configuration parameters to force overwrite the rule.
IO Hang action execution failed. The returned message indicates that the operating system is not obtained.
The user's operating system does not support the execution of this action. The platform currently supports the following operating system versions: CentOS 7.2 and later, Debian 8.2 and later, Ubuntu 16.0.4 and later, and TencentOS.
3. What Are the Common Causes for Execution Failures of Redis ®* OSS-Compatible-related Fault Actions?
Primary-secondary switch action execution failed. The returned message indicates that a primary-secondary switch cannot be executed because the instance does not have cross-AZ replicas.
The instance is upgraded to support cross-AZ deployment, but no cross-AZ nodes are present, so Tencent Cloud Distributed Cache (Redis ®* OSS-Compatible) cannot perform primary-secondary switch. You need to go to the Redis ®* OSS-Compatible instance details page and add replicas in other AZs before you can simulate the primary-secondary switch.
4. What Are the Common Causes for Execution Failures of Container-related Fault Actions?
Action execution failed. The fault action returned a YAML file containing the message "Cannot connect to the Docker daemon at unix:///var/run/docker.sock.".
The node where the Pod or container for this action resides restarted dockerd, causing the docker.sock mounted on the agent tool to became invalid.
Solution:
Delete the chaosblade-tool Pod from the node with the specified IP address and recreate the Pod.
Log in to the TSA console and choose Chaotic Fault Generator > Agent Management in the left sidebar to go to the agent management page. On this page, uninstall the agent and then reinstall the agent in the corresponding cluster.
High disk load action executed successfully. The recovery action returned a YAML file containing the message "error: 'pods/exec': k8s exec failed, err: command terminated with exit code 137".
When the disk usage of the node lock reaches a certain threshold, container deletion and recreation are triggered. The impact caused by the fault action is automatically resolved after recreation, eliminating the need to repeatedly execute the recovery action.
High node memory load action execution failed. The message "object is being deleted: chaosblades.chaosblade.io "mem-load.5262.cls-xxx" already exists." is returned.
When the memory usage of a node reaches a certain threshold, the kubelet process running on the node is blocked. The recovery action only issues a command to delete the experiment and returns a success status. However, the actual execution depends on the kubelet process, which is blocked due to insufficient memory resources. As a result, the experiment deletion task remains unexecuted. In this case, if the user attempts to perform fault injection again, the "mem-load.5262.cls-xxx already exists" error will be reported. Both node monitoring and TKE node health checks will report exceptions. The system will only recover after the node completes the execution of the previously blocked recovery action, which typically takes about 30 minutes.
Agent Installation and Management Issues
1. Agent Resource Usage
CVM Agent
The fault injection agent for CVM is an executable program pre-installed on CVM hosts (located in the /data/cfg/chaos-executor directory). When a specific fault is selected, installation of the fault injection agent is required. During fault injection, this program is executed. The agent consumes less than 1 MB of disk resources. During network fault injection, CPU and memory utilization does not exceed 1% of the system resources. In scenarios involving memory and CPU stress, the resource usage is roughly equivalent to the target values configured for stress testing.
Container Agent
After the fault injection agent for containers is installed, the following resources will be created in the cluster:
1. Namespace: tchaos.
2. ClusterRole: chaosmonkey. The rules for this role are as follows, indicating that the agent operator will be granted corresponding permissions on Kubernetes APIs.
rules:
- apiGroups:
- ""
resources:
- namespaces
- nodes
verbs:
- get
- list
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- update
- delete
- create
- patch
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
3. ServiceAccount: chaosmonkey. Note that it is created in the tchaos namespace.
4. ClusterRoleBinding: Bind the ClusterRole and the ServiceAccount.
5. Operator: A deployment named chaos-operator is started in the tchaos namespace, with one replica. The Pod uses the ServiceAccount that is named chaosmonkey and created in the previous step. The maximum resource usage is 1 CPU core and 2 GB of memory (Limit). After the agent is installed, the chaos-operator will remain running, consuming cluster resources. To control costs, uninstall the agent immediately after fault injection.
6. During fault injection, the operator temporarily creates a helperpod on the target node to inject faults. The helperpod is non-invasive to the target Pod. In other words, the helperpod is not a sidecar. Additionally, to achieve the specified stress scenarios, no resource usage limits are imposed on the helperpod. During fault recovery, the temporary helperpod is automatically deleted.
7. The fault injection logs and experiment records of the helperpod are saved in the /var/log/chaos directory of the node, typically less than 10 KB in size.
Note:
The fault injection logs of the helperpod are not deleted when the agent is uninstalled. If necessary, manually delete these logs.
2. Abnormal Agent Status Detected
Issue Example
An abnormal agent status is detected. For solutions, see the corresponding FAQs document.
Solution
Check the deployment workload chaos-operator in the tchaos namespace to see whether the Pod has started. If it has not started, check the event logs for exception information. The following are some event types that may prevent the Pod from starting, along with the corresponding solutions:
|
OutOfMemory or OutOfCPU | Check whether there are sufficient resources in the cluster to run the agent. You may need to increase cluster resources or adjust other workloads to free up resources. |
InsufficientStorage | Check whether there is sufficient storage space in the cluster to run the agent. You may need to increase storage capacity or clean up unnecessary data to free up storage space. |
FailedScheduling | This event may occur because no node in the cluster can meet the Pod's scheduling requirements. Check the Pod's scheduling constraints, as well as the status and tags of the nodes in the cluster. |
CrashLoopBackOff or Error | This event may occur due to a program error or a configuration issue in the agent. View the Pod's logs for more detailed information and troubleshoot the issue based on the error messages found in the logs. |
ImagePullBackOff | This event may occur due to the inability to pull images from the image registry. Check whether your image registry address and credentials are correct, and whether the network connection is functioning properly. |
NotTriggerScaleUp | This event may occur because the auto-scaling policy of the cluster is not triggered. Check the auto-scaling policy configuration of your cluster to ensure that the policy can be correctly triggered for scale-out when needed. |
3. An Agent that Cannot Be Automatically Uninstalled Has Been Detected. You Need to Manually Uninstall it First.
Issue Example
An agent that cannot be automatically uninstalled has been detected. You need to manually uninstall it first. For details, see the corresponding FAQs document.
Solution
In this case, you need to manually delete the following Kubernetes resources:
clusterrole: chaosmonkey
clusterrolebinding: chaosmonkey
serviceaccount: chaosmonkey (in the tchaos namespace)
namespace: tchaos
deployment: cloudchaos-operator (in the tchaos namespace)
Note:
After manually uninstalling the agent, you do not need to manually install a new one. Go to the Agent Management page to proceed with the installation.
After uninstalling the agent, ensure that your cluster is in a normal status so that the new agent can be installed smoothly. If you encounter any issues during the installation process, view the relevant logs for more detailed information and troubleshoot the issues based on the error messages found in the logs.