Sr Site Reliability Engineer (Remote) at AT&T in Chicago
Skip to Main Content
The brighest minds, the boldest possibilities. Image: Male with glasses smiling.

Sr Site Reliability Engineer (Remote)

Chicago, Illinois


The Resiliency Lead Architect will be responsible for partnering with the various Consumer Technology Platform (CTP), Chief Technology Information Office (CTIO), Operations/Infrastructure, and network teams in implementing a comprehensive resiliency engineering framework. The architect will be responsible for planning, designing, and rolling out proactive resiliency practices which protect customer journeys from disruption and avoid re-engineering costs through the early detection of existing and emerging resiliency threats. The successful candidate will be a strong technologist who is flexible, resilient, an innovative thinker, as well as a natural collaborator with solution architects, software engineers, developers and senior management from across the organization. The Resiliency Architect is expected to lead through influence, communicate effectively through clarity of thought and demonstrated understanding of business and technical requirements. In addition, the candidate must possess strong technical leadership skills and demonstrated success in working with teams particularly in a matrix fashion.

Key Responsibilities:

  • Design and roll out robust impact assessment framework that will validate impact of changes to performance of individual applications as well as the consumer technology ecosystem
  • Design, develop and implement chaos engineering practices for the consumer technology ecosystem
  • Work with performance architects to design performance tests based on customer journeys that will be used to validate performance and resiliency of the consumer technology ecosystem
  • Collaborate with operations and application engineering teams to design and execute production game day scenarios that will help enhance emergency response processes
  • Provide key SME leadership within Consumer Quality Engineering (CQE) team on resiliency programs and initiatives
  • Work closely with LOB Security architects and GTI infrastructure technologists to develop remediation solutions, where appropriate
  • Ensure all implemented resiliency solutions have validation plans in place including continuous improvement plans
  • Define and implement post-mortem / root-cause analysis processes – develop improved testing scenarios based upon analysis
  • Develop requirements to enhance observability of performance visuals, implement telemetry controls, and consult on self-healing capabilities for identified/prioritized failure scenarios
  • Design self-healing and resiliency patterns

  • Experience with development technology stack Programming tools like Docker, Python, Django, Celery, Postgres is a must
  • 10+ years of strong hands-on experiences and technical depth in one, or more technology areas, including software engineering, solution architecture, production operations, distributed technologies, performance engineering, resiliency/chaos engineering, or cloud based ecosystems.
  • Experience with microservice architecture and containerization technologies like Docker and Kubernetes.
  • Working knowledge of infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks).
  • Knowledge of application architecture concepts, including topology, protocols, components, and principles would be advantages
  • Some Programming experiences in one or more languages (scripting/functional/imperative -- C/C++, Java, Python, Scala, R, SQL, etc.) would be advantages
  • Proven leader with successful track record architecting and rolling out technology transformation initiatives
  • Strength in both business and technical requirements analysis
  • Strong written and verbal communication skills
  • Ability to think strategically about how to create firm wide solutions to business requirements and ability to communicate effectively to both business and technical audiences
  • Ability to orchestrate and drive complex strategies and solutions
  • Proven ability to build strong, cohesive partnerships with the business, operations, technology & other key stakeholders, including external vendor partners, and work effectively in a matrix organization.
  • Superior analytical and problem solving skills
  • Working knowledge of the following technologies Kubernetes Container, CI/CD, Jenkins, Chaos Testing
  • Fault domain analysis experience for both Core Infrastructure services and modern micro segmented application designs
  • Subject matter expert in business/service continuity, availability, disaster recovery and/or similar topics

Job ID 2105142-1 Date posted 01/21/2021

Associate Director Technology Development


Opportunity to work on cutting edge technologies.
Support for women in technical leadership roles.
Pride in diversity & inclusion with 12 Employee Resource Groups with 40k+ members.
Great benefits including 4+ weeks vacation, 6% salary match of 401k, paid maternity/paternity leave, financial support for adoption.
Flexibility to work from home or office in newly renovated collaboration zones.
Lots of opportunity to move around the company & work on new products.


Process heavy with lots of administrative overhead.

Current Employee - Associate Director Technology Development
  • One Star Rating
  • Two Star Rating
  • Three Star Rating
  • Four Star Rating


This is the life – the #LifeAtATT, that is. We’re creating what’s next and having a blast doing it. You’re looking for proof? Well, see for yourself.

Back to top