Job Description
Responsibilities
- 'Defining our systems’ reliability goals via Service Level Objectives (SLOs)
- Improving our systems’ production posture via targeted observability and operability enhancements (telemetry, alerting, incident management, change management, safe production changes)
- Building reusable automation and processes that empower multiple teams to achieve their reliability goals
- Influencing the product architecture and roadmap to make sure the customer-experienced reliability is always a key consideration when evolving the product
We would like to talk to you if you are looking for role around below themes
- Drive reliability throughout the Azure Monitor observability, informed architectural improvements, and automation
- Develop clean and thorough designs and code that exemplify quality, simplicity, and maintainability with global scalability
- Embody the Microsoft Leadership Principles by creating clarity, generating energy, and ultimately delivering success of the right outcomes from ideation to implemented solution
- Mentor and teach engineers across Azure to improve visibility, use of tools to diagnose, and scale learnings through improved documentation and training
- Encourage a culture of observability and provide technical leadership to implement and scale observability across Azure
Required Qualifications :
- 7+ years technical experience in software engineering, network engineering, or systems administration
- OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering
- OR Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering
- Experience working with large-scale distributed systems (e.g., cloud computing providers, SaaS services, etc., ideally with millions or billions of users) or similarly complex environments
Preferred Qualifications:
- Awareness of, and ability to reason about, modern software & systems architectures, Cloud Infrastructure, including load-balancing, queueing, caching, distributed systems failure modes generally, microservices, and so on
- 4+ years of design, build, or implementation of distributed service health – Specifically desired is a deep understanding and familiarity with MELT (Monitoring, Events, Logging, Tracing) design and implementation patters for large-scale distributed services
- Previous experience as a technical lead that can drive engineering solutions
- Prior experience in building Azure Services will be a plus
- Aspire to grow as a person, as a teammate, and as an engineer