Site Reliability Engineer

ClearScale (headquartered in San Francisco, California, USA) – AWS Premier Consulting Partner has been offering a full range of professional cloud computing services for over 10 years, including architecture design, DevOps automation, refactoring and cloud-native applications development, integration, migration, solving all sorts of security issues (from just a security check to preventing cyber-attacks) and 24/7 technical support using the best advanced technologies.The list of our customers is diverse: from government companies (ClearScale is an official cloud partner of the State of California) and educational institutions (University of California, San Francisco) to well-known global brands (IBM, Samsung, GoPro, HP, Conde Nast, Carl Zeiss, etc.) The number of satisfied customers has been well over 850, some of which can be found on the company’s website in the Case Studies section.
We were the third company to gain a new AWS competence: Applied AI and Machine Learning Operations (MLOps). Less than 15 partners have this competency!We have confirmed status in the Database Freedom program (total less than 20 companies in)Proven expertise as a Managed Services Provider (total less than 16 companies in). This means that ClearScale can perform a full cycle consultancy and service: from audits, system or software development to the 24/7. You can read more about us on the company page – Managed Services.
Since the very foundation of the company, we work 100% remotely from various cities and countries.
Job Responsibilities:Execute on Observability StrategyDefine and document standards for logging, tracing and SLO definitions for engineering teams to followPropose effective ways to manage dashboards, traces, monitors, metrics and logs in Datadog Integrate Datadog with incident management tools and SlackEstablish comprehensive monitoring using DatadogCentralize logging and developing mechanisms for efficient debuggingImplementing systems for distributed tracing visualizationAdopting OpenTelemetry standards across microservicesRolling out observability to development and production environments in close collaboration with engineering and operations teamsDefine training practices for engineering teams to adopt observability standards and operational practises for healthy and sustainable incident management processesImplementing POCs and demonstrating such constructs to engineering teamsIntroduce engineering practices for healthy alerting mechanisms, dashboard definitions and blind-spots elimination with a focus on eliminating alert fatigueEstablish near real time reporting to minimize MTTA and MTTR and improve developer experience
RequirementsExtensive experience with AWS infrastructure at scaleExperience working in SRE, DevOps or Developer Experience teams in engineering organizations is a mustDeep knowledge of observability tooling (Datadog, Grafana, Splunk, OTEL) and hands-on experience developing, extending and operating them across different environments including high-loaded production systems Expert knowledge of TerraformAbility to propose solutions that scales across engineering teams and balance speed of response and cognitive loadExperience leading incident responses utilizing operational tools including logging, tracing, SLO patterns and syntheticsExperience establishing technical roadmaps from operational strategies for SRE, DevOps or Developer Experience teams in mid to large sized organizations and ability to drive its adoption in the engineering teamsExperience applying analytical practices to define SLAs in close coordination with engineering teams and stakeholdersDeep understanding and experience advocating for and rolling out SRE best practices and standards for engineering teamsMindset of “minimal tooling for maximum impact”Experience with on-call rotations, creating and executing scalable practices in engineering teamsExperience with integrating observability tooling with Teams and SlackLeadership skills to drive alignment between different departments and get buy-in from different stakeholdersExemplary oral and writing skills for technical and non-technical stakeholdersAWS certifications are a plus
Additional Information TL/DR:Type of Position: SRE or DevOps Engineers with deep knowledge of observability and reliability practices.Skills: Datadog-specific knowledge.Proficiency in Python, Terraform, Git, and pipeline setup.Focus on implementing observability for engineering teams.Demonstrated ability to apply analytical skills to cloud compute environments
We offer100% remote positionFull-time positionAnnual rate reviewUSD Salary
Professional DevelopmentWork with innovative Silicon Valley companies and traditional American companies at the cutting edge of digital transformationWe work with the newest technologies in AWS cloud and open-source tools like Jira, Confluence, Lucidchart, Slack etcWe operate in an honest and competitive environment and we are one of the AWS’s top 10 key partners.The team willing to share its experiencePaid AWS certifications: we provide training material, paid time off and examination itselfHorizontal and vertical career growth – We keep growing and people keep growing with us

To apply for this job please visit www.linkedin.com.

Related Jobs