We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results

Workplace Platforms - Site Reliability Engineer (SRE) Lead - Dallas

The Goldman Sachs Group
United States, Texas, Dallas
May 13, 2026

Team Overview

The Workplace Engineering organization is responsible for the reliability, resilience, and operational integrity of the firm's endpoint compute platforms and services, including:



  • Corporateowned physical devices
  • Virtual and cloudhosted desktops
  • Core endpoint services such as device lifecycle management, access and identity integration, profile and session services, and application delivery frameworks


The Endpoint Compute SRE function applies Site Reliability Engineering (SRE) principles to ensure these platforms and services are highly available, observable, scalable, and recoverable, while meeting operational and regulatory expectations.

Role Summary

We are seeking an Endpoint Compute SRE Lead to own reliability engineering and operational excellence across endpoint compute platforms and their foundational services.

This role is focused on systems and services, not applications, and covers the reliability of:



  • Endpoint compute platforms (physical, virtual, cloud desktops)
  • Device and desktop lifecycle services
  • Access and signin dependency platforms
  • Profile, policy, and session services
  • Application delivery and execution frameworks (packaging, deployment, availability-not app functionality)


The successful candidate will define service-level objectives, observability strategies, failure models, and operational practices that ensure a predictable and resilient enduser compute experience at enterprise scale.

Job Responsibilities

Reliability Engineering Across Endpoint Services



  • Own end-to-end reliability of endpoint compute platforms and supporting services
  • Define service boundaries, dependencies, and critical paths from user signin through productive desktop use
  • Model failure modes and blast radius across lifecycle, access, and delivery services
  • Drive designs that support graceful degradation and fast recovery


Observability & Telemetry



  • Establish observability standards across endpoint compute services, including:

    • Enrollment and provisioning success rates
    • Access and session establishment health
    • Policy and profile delivery latency/failures
    • Application delivery availability

  • Ensure telemetry enables:

    • Fast incident detection
    • Root cause analysis
    • Proactive trend identification



SLOs, SLIs & Error Budgets



  • Define SLOs and SLIs for key endpoint services (e.g., signin success, provisioning time, policy convergence)
  • Implement error budget frameworks to guide change, security control rollout, and platform evolution
  • Use reliability signals to influence platform design and operational priorities


Incident, Problem & Resilience Management



  • Lead reliability aspects of incident response involving endpoint compute or services
  • Drive postincident reviews focused on systemic corrections
  • Identify recurring failure patterns in:

    • Lifecycle flows
    • Access paths
    • Policy or profile delivery

  • Sponsor and track permanent fixes, not workarounds


Operational Excellence & Automation



  • Define and maintain runbooks, playbooks, and escalation models for endpoint services
  • Drive automation to reduce:

    • Manual remediation
    • Repeat incidents
    • Operational toil

  • Influence engineering designs to improve operability and debuggability


Risk & Governance Alignment



  • Partner with Technology Risk and Security teams to:

    • Demonstrate reliability and recoverability controls
    • Support operational risk and resilience assessments
    • Provide auditready evidence for availability and incident management

  • Ensure reliability metrics support control effectiveness narratives


Leadership & Collaboration



  • Act as the reliability authority for endpoint compute and services
  • Partner closely with:

    • Endpoint platform engineers
    • Device management teams
    • Security engineering and identity teams

  • Mentor engineers in applying SRE principles to workplace platforms
  • Communicate reliability posture clearly to leadership


Basic Qualifications



  • 8+ years in SRE, platform operations, reliability engineering, or workplace infrastructure roles
  • Strong experience operating endpoint compute platforms and core supporting services at enterprise scale
  • Proven ability to define and implement:

    • Observability frameworks
    • SLOs / SLIs
    • Incident and problem management models

  • Strong systems thinking across lifecycle, access, and service dependencies
  • Excellent documentation and communication skills


Preferred Qualifications



  • Experience applying SRE concepts to enduser computing or digital workplace platforms
  • Deep understanding of:

    • Device lifecycle and provisioning services
    • Identity and access dependencies (availability-focused)
    • Profile, policy, and session orchestration

  • Experience in regulated or highassurance environments
  • Strong ability to influence architecture using datadriven reliability insights


What Success Looks Like



  • Endpoint compute and services have clear reliability targets
  • Lifecycle, access, and delivery failures are predictable, observable, and fast to remediate
  • Incidents are less frequent, shorter, and less impactful
  • Platforms are designed with operability and resilience built in
  • Leadership has confidence in desktop stability as a service

Applied = 0

(web-bd9584865-ftqzq)