Cloud and AI System Intern
IntelPRC, ShanghaiPosted 24 April 2026
Job Description
Job Details:
Job Description:
In this position, you will work with a system reliability research team focusing on RAS (Reliability, Availability, Serviceability) and silent data error (SDE) characterization and mitigation on AI and general-purpose compute platforms, including heterogeneous systems (CPU + GPU/accelerators) and large-scale server clusters. You will help design and run experiments under representative AI training/inference and cloud workloads, analyze fleet-scale logs/telemetry, and prototype detection/diagnosis methods to improve end-to-end data integrity and platform robustness across the HW/FW/OS/runtime stack.
Your responsibilities will include but not be limited to:
-Collect, clean, and analyze platform telemetry / error logs from CPU servers and accelerator-enabled nodes (e.g., memory/DDR/HBM, storage, interconnect, PCIe/CXL, fabrics) to identify error signatures and failure patterns.
-Design and execute fault injection, stress tests, or workload-driven experiments to reproduce silent data corruption scenarios for AI training/inference and general compute workloads, and validate hypotheses.
-Research and analyze in-field scan and lockstep mode features (coverage, limitations, trigger conditions, and impact on AI/CPU workloads), and help evaluate how they can be leveraged to improve silent error detection and data integrity in production.
-Research and analyze Silicon Lifecycle Management (SLM) solutions, and integrate them with platform telemetry to enable in-field health monitoring, degradation/trend analysis, and proactive reliability improvements for AI/CPU platforms.
-Develop scripts/tools (Python preferred) to automate data processing, experiment orchestration, and report generation; build dashboards or repeatable pipelines when needed.
-Study and evaluate mitigation techniques for AI + CPU platforms (e.g., ECC/CRC/EDAC, scrubbing policies, retry/recovery, checkpoint/restart, end-to-end checks at data/communication boundaries) and quantify effectiveness vs. performance/cost impact.
-Collaborate with cross-functional teams (HW, FW, OS, driver/runtime, datacenter operations) to trace error propagation paths and drive actionable improvements; document findings and present progress regularly.
Qualifications:
Cloud and AI System Engineering Intern
Description
In this position, you will work on a system reliability research topic with platform engineering team focusing on RAS (Reliability, Availability, Serviceability) and silent data error (SDE) characterization and mitigation on AI and general-purpose compute platforms, including heterogeneous systems (CPU + GPU/accelerators) and large-scale server clusters. You will help design and run experiments under representative AI training/inference and cloud workloads, analyze fleet-scale logs/telemetry, and prototype detection/diagnosis methods to improve end-to-end data integrity and platform robustness across the HW/FW/OS/runtime stack.
Your responsibilities will include but not be limited to:
-Collect, clean, and analyze platform telemetry / error logs from CPU servers and accelerator-enabled nodes (e.g., memory/DDR/HBM, storage, interconnect, PCIe/CXL, fabrics) to identify error signatures and failure patterns.
-Design and execute fault injection, stress tests, or workload-driven experiments to reproduce silent data corruption scenarios for AI training/inference and general compute workloads, and validate hypotheses.
-Research and analyze in-field scan and lockstep mode features (coverage, limitations, trigger conditions, and impact on AI/CPU workloads), and help evaluate how they can be leveraged to improve silent error detection and data integrity in production.
-Research and analyze Silicon Lifecycle Management (SLM) solutions, and integrate them with platform telemetry to enable in-field health monitoring, degradation/trend analysis, and proactive reliability improvements for AI/CPU platforms.
-Develop scripts/tools (Python preferred) to automate dat ... (truncated, view full listing at source)
Apply Now
Direct link to company career page
AI Resume Fit Check
See exactly which skills you match and which are missing before you apply. Free, instant, no spam.
Check my resume fitFree · No credit card