StackPath is a platform of secure Internet services built at the cloud's edge. StackPath services enable developers to build protection and performance into any cloud-based solution—from apps, to games, web sites, and beyond—without needing cloud security and delivery expertise of their own. More than 800,000 customers already use StackPath services, ranging from early-stage enterprises to Fortune 100 organizations. Headquartered in Dallas, Texas, StackPath has offices across the U.S. and around the world.
For more information follow StackPath at www.fb.com/stackpathllc and www.twitter.com/stackpath.
About the Role
The StackPath Site Reliability Engineering (SRE) team combines software, systems and network engineering to deploy and run a portfolio of high-performance edge services including CDN, WAF and Compute. SRE’s daily focus is on the availability, change velocity, performance and capacity of customer-facing services and supporting internal systems.
On the SRE team you will have the opportunity to apply your experience against systems at scale – where a single week can involve shifting terabits of traffic between sites, deploying configuration changes to shave milliseconds off billions of requests, or enabling a new software feature on thousands of systems using automated tooling you designed and built.
This role will report to our: VP Site Reliability Engineering
Essential Duties and Responsibilities
- Respond to incidents during on-call duty
- Respond to complex customer escalations, which often cross system, network and software boundaries
- Design, develop and maintain internal service metrics (SLA, SLO, SLI) in cross-team collaborations
- Design, develop and maintain dashboards, tooling, alarms and playbooks in collaboration with operations teams to support service-level objectives
- Design, develop and maintain reusable monitoring and canary infrastructure
- Design, execute and evaluate performance experiments
- Collaborate with development teams to complete production readiness checklists prior to major feature launches
- Collaborate with operations and engineering teams in determining root cause of major incidents, performance anomalies, or other customer-impacting issues
Desired Skills and Experience
- Experience with monitoring and alerting platforms (Prometheus and Alertmanager, Grafana, Zabbix, Nagios)
- Experience with a Linux server environment
- Experience with scripting languages (Python, Ruby, Perl)
- Experience with systems programming languages (Go, C)
- Experience with configuration management systems (Puppet, Ansible, Chef)
- Expert-level proficiency in systems, network or software engineering
- Excited about working on a remote-first engineering team
- Proficient at troubleshooting complex systems
- Production experience in a service provider environment
- Comfortable with a software engineering workflow for collaboration and configuration management — branches, pull requests, merges, conflicts
Projects you might work on
- Product launches
- Software and platform feature releases
- Live streaming event planning and execution
- Network reach and capability expansion
- Network and system automation tooling development
- Telemetry and monitoring system development
- Defining service metrics (SLA, SLO, SLI) during new product development
This job description is not intended to be all-inclusive.
StackPath is an Equal Opportunity Employer. EOE/AA M/F/D/V
If your experience and qualifications match our current needs, a member of our human resources team will contact you. We look forward to hearing from you.