Huawei introduced its MindOps Intelligent Computing O&M Solution in Barcelona, targeting higher availability and operational stability for large-scale AI computing clusters. The company positioned MindOps as an integrated operations and maintenance (O&M) platform spanning compute, storage, and networking infrastructure in AI data centers. Huawei said the system aims to raise cluster availability from an industry average of 90% to 99.9%, addressing the operational demands of AI training and inference workloads moving into production environments.
MindOps is built on a 7-layer digital twin architecture for AI data centers (AIDC), providing observability from facility-level infrastructure through AI models and applications. The layers span L1 data center infrastructure, L2 compute cluster infrastructure, RoCE networking, collective communications, AI platforms, models, and application layers. Huawei integrated its EDNS 2.0 professional large model into the platform to enable minute-level fault demarcation, predictive risk perception, and automated switchover mechanisms. The company said the system delivers “second-level” visibility into operational status, enabling proactive remediation of issues such as slow accelerators, network congestion, and model performance degradation.
The solution also introduced equipment health self-check capabilities. Using risk perception algorithms, MindOps performs periodic assessments of critical components including liquid cooling systems, coolant distribution units (CDUs), and optical modules. Huawei said the platform generates pre-failure alerts and guides O&M teams through mitigation steps before service impact occurs. By combining digital twin modeling with AI-driven diagnostics and automated failover, the company said it redefines intelligent computing O&M to ensure long-term stability and sustained performance of AI computing platforms.






