Site Reliability Engineer (SRE) - Big Data at AT&T Careers - AT&T Careers
Skip to Main Content

We’ve got the data,
you bring the insight.

Big Data Jobs

As a member of our Big Data team, you’ll work with awesome people in a start-up environment. With one of the biggest data sets in the world, you’ll use data-driven analytics to tackle our business challenges and drive the innovation that changes the lives of our customers. In fact, your work will impact major decisions that go all the way to the top. This is your chance to turn the massive scale of possibilities at AT&T into an equally big opportunity.

Site Reliability Engineer (SRE) - Big Data

New York, New York



As a Site Reliability Engineer (SRE) on the Big Data Operations (BDO) Team, you will responsible for building, operating and supporting our heterogeneous Data Systems Platform in the Technical Operations group. The Data Systems Platform consists of large Hadoop, HBase, Kafka installations, several messaging platforms as well as real time data platforms. The platform currently ingests 200TB of new data and performs 20,000 ETL jobs every day across 5 Hadoop, 4 HBase and 6 Vertica Clusters.

About the Team:

The Technical Operations (TechOps) Team is distributed across the globe and handles a wide variety of responsibilities, from providing tech support to architecting long-range build-out and day-to-day operations at our six global data centers. We have well over 7,000 servers, which process over 1 million Ad Serving Requests per second (billions per day). We are in search of troubleshooters and those who love to tinker and innovate with technology.

About the Job:

• Monitor, maintain and provision components of the Data Systems Platform

• Perform software upgrades on the components of the Data Systems Platform

• Work with Data Engineering team to help design and implement next iteration of scaling, and evaluate Open Source and Commercial software and hardware solutions

• Work closely with the systems performance, systems operations, and network engineering teams as needed to ensure high performance and availability

• Develop and/or implement tools to automate aspects of supporting, maintain and build the Data Systems Platform, including upgrades where appropriate

• Participate in prototyping and proof-of-concept system development and benchmarking

• Support, maintain and build storage restructuring

• Participate in on-call rotation responding to alerts and systems issues

• Operate user access and resource allocations to Data Systems Platform


• 5+ years of relevant experience in implementing, troubleshooting, and supporting the Unix/Linux operating system with concrete knowledge of system administration/internals

• 5+ years of relevant experience in scripting/writing/modifying code for monitoring/deployment/automation in one of the following (or comparable): Python, Shell, Go, Perl, Java, C

• 3+ years of relevant experience for all of the following technologies: Hadoop-HDFS, Yarn-MapReduce, HBase, Kafka

• 3+ years of relevant experience with Puppet, Chef, Ansible or equivalent configuration management tool

• 2+ years of relevant experience with TCP/IP networking (DNS, DHCP, HTTP etc.)

Beneficial skills and experience (if you don’t have all of them, you can learn them at Xandr):

• Experience with JVM and GC tuning is a plus

• Regular expression fluency

• Experience with Nagios or similar monitoring tools

• Experience with data collection/graphing tools like Cacti, Ganglia, Graphite and Grafana

• Experience with tcpdump, ethereal, tshark and other packet capture and analysis tools

More About You:

• You are passionate about a culture of learning and teaching. You love challenging yourself to constantly improve, and sharing your knowledge to empower others

• You like to take risks when looking for novel solutions to complex problems. If faced with roadblocks, you continue to reach higher to make greatness happen

• You care about solving big, systemic problems. You look beyond the surface to understand root causes so that you can build long-term solutions for the whole ecosystem

• You believe in not only serving customers, but also empowering them by providing knowledge and tools

Job ID 1931316 Date posted 06/18/2019

Big Data Intern


Good experience overall. Colleagues were very helpful. Atmosphere was chilled out and there was no rush to complete the project. Was given ample amount of time to understand the project and contribute.


People are laid back and don't take initiative to do something new or optimize existing stuff. It's not primarily engineering company so if you know how to talk, you will go far longer in career.

Current Employee - Principal Business Manager
  • One Star Rating
  • Two Star Rating
  • Three Star Rating
  • Four Star Rating


This is the life – the #LifeAtATT, that is. We’re creating what’s next and having a blast doing it. You’re looking for proof? Well, see for yourself.

Back to top