Senior Performance and Capacity Engineer
StackPath is cloud platform built at the internet’s edge, providing infrastructure and services physically closer to the source or destination of data than hyperscale cloud service providers. StackPath edge compute—including Virtual Machines and Containers—and edge applications—including CDN and WAF—are strategically located in the world’s most densely populated areas, and united by a secure private network backbone and a single management system. Customers ranging from Fortune 50 enterprises to one-person startups trust StackPath to give their latency-sensitive workloads and applications the speed, security, and efficiency they require. For more information, visit stackpath.com and follow StackPath at www.fb.com/stackpathllc and www.twitter.com/stackpath.
About the Role
As a Senior Performance and Capacity Engineer you will work closely with the Site Reliability Engineering team to provide accurate and insightful capacity projections for the senior management team at StackPath. This role is critical for maintaining server and network resources needed to serve customers across the Edge Delivery and Edge Compute platforms. You will lead the effort to deliver accurate and timely capacity checks for new and growing customer deployments. You will also engage in performance troubleshooting to identify and remove live bottlenecks in the delivery environment.
This role will report to: VP Site Reliability Engineering
Essential Duties and Responsibilities
- Handle complex enterprise issues, which often cross system, network, and software boundaries.
- Design, develop and maintain internal service metrics (SLA, SLO, SLI) in cross-team collaborations.
- Design, develop and maintain dashboards, tooling, alarms, and playbooks in collaboration with operations teams to support service-level objectives.
- Design, develop and maintain reusable monitoring and canary infrastructure.
- Design, execute and evaluate performance experiments.
- Collaborate with operations and engineering teams in determining root cause of major incidents, performance anomalies, or other customer-impacting issues.
- Discover and analyze system performance related bottlenecks.
- Discover and analyze anomalies and system issues, with the goal of figuring out root causes and mitigating them.
- Writing ETLs to extract performance related KPIs and presenting the said KPIs in a systematic manner.
- Capacity planning using regressive machine learning models, and other statistical methods when applicable.
- Automating everyday repeatable items.
- Modeling Traffic Growth and making server purchasing recommendations.
- Develop enterprise client traffic flow modeling, distribution, and capacity checks.
- Direct and participate in automation of performance and capacity checks and need for capacity augmentation.
Desired Skills and Experience
- High level knowledge of Linux and operating systems.
- High level of WAN networking knowledge.
- Scripting languages (Bash, Python, PHP, Perl).
- Experience with Prometheus.
- Experience with Grafana, Docker, GCP, Telegraf, and Tableau.
- DB knowledge (MySQL, PostgreSQL, TimeScaleDB and others)
- High level understanding of Statistics.
- High level understanding of Machine Learning.
- Experience with traffic analyzing tools (Catchpoint, Kentik, Cedexis...)
- Experience with CI/CD/CM tools (Jenkins, Ansible, Puppet, Chef...)
- Experience with Virtualization ( KVM, QEMU...)
This job description is not intended to be all-inclusive.
StackPath is an Equal Opportunity Employer. EOE/AA M/F/D/V
If your experience and qualifications match our current needs, a member of our human resources team will contact you. We look forward to hearing from you.