Job Description
About this role
Role Overview:
We’re seeking a Site Reliability Engineering (SRE) Lead to design, build, and
maintain
resilient, high-scale systems supporting BlackRock’s Private Markets platform. In this hands-on leadership role,
you’ll
apply deep engineering
expertise
to solve complex challenges, guide a global team, shape technical direction, and communicate effectively with senior stakeholders—ensuring the reliability of mission-critical systems that power private market investment workflows and decision-making. You will drive the adoption of AI-driven solutions to accelerate incident detection and triage, reduce toil, improve forecasting and capacity planning, and strengthen end-to-end observability and resilience.
Role Responsibilit ies
Take ownership of project priorities, deadlines and deliverables using Agile methodologies, with clear outcomes around reliability automation and AI-enabled operations
Understand and refine business and functional requirements, translating them into SLOs/SLIs and AI-assisted observability and support capabilities
Hands on approach to getting work done—this role requires a “roll your sleeves up” mentality, including building and operationalizing reliability tooling and automation that measurably reduces toil and improves stability
Be a leader with vision and a partner in brainstorming solutions for team productivity and efficiency
to improve engineering effectiveness
Drive priority setting of the engineering teams, balancing foundational reliability work with delivery of new product features
Improve Engineering culture by encouraging continuous focus on reliability across the entire application lifecycle, and by adopting AI-enabled SRE practices (e.g., intelligent alerting, automated diagnosis, and self-healing where
appropriate)
Proactive participant in architectural and design decisions, including AI-ready telemetry, data quality, and model integration patterns for operational analytics
Design and implement end-to-end monitoring solutions for application and infrastructure components,
leveraging
modern observability platforms plus AI/ML techniques for anomaly detection, correlation, and alert noise reduction
Drive the engineering of capacity management and demand forecasting solutions, including predictive analytics/ML approaches where they add measurable value
Act as a
culture carrier
and
leader , passing on
SRE
knowledge
and best practices to the
engineering
team
Drive detailed root cause investigations for production incidents with rigorous focus on issue avoidance, using AI-assisted correlation/analysis to accelerate time-to-insight
Create/coordinate retros for significant incidents, ensuring learnings are captured in automated/AI-assisted runbooks and embedded into prevention mechanisms
Additional
core engineering functions, such as adding custom telemetry metrics/logs/traces to the code base of in-scope applications to enable AI/ML-driven operational insights
Anticipate new opportunities to continuously evolve the resiliency profile of scoped applications and infrastructure
Skills/Qualifications
Must Have
B.S. / M.S. degree in Computer Science,
Engineering
or a related discipline with 10 years of experience
Experience leading high performing engineering/SRE teams, with
a track record
of driving continuous improvement through automation and AI-enabled operations
Demonstrated ability to
represent
engineering/SRE priorities, status, and risk to senior leadership stakeholders with clear, executive-ready communication
Hands-on experience building or operating AI-assisted capabilities (AIOps, ML-based anomaly detection, or GenAI workflows) in an engineering/production environment
A passion for providing engineering support for
highly available , performant full stack applications with a “Student of Technology” attitude
Experience with relational
database
and NoSQL Database ( e.g.
Redis,
Apache Cassandra)
Our benefits
To help you st ... (truncated, view full listing at source)