Job Overview:
We are currently seeking a DataDog Engineer / Site Reliability Engineer with strong expertise in DataDog, APM monitoring, and observability for an exciting opportunity with one of our clients. As part of this project, you will be responsible for improving the performance, reliability, and overall health of critical systems. This role will involve working closely with application teams and stakeholders to implement monitoring best practices and ensure system stability. This is a remote position for a project expected to last 1-3 years.
Key Responsibilities:
* Expertise in DataDog: Utilize advanced knowledge of DataDog, including its latest features such as RUM (Real User Monitoring) and Vulnerability Scanning, to optimize monitoring solutions.
* Monitoring & Observability: Identify gaps in system performance and improve monitoring processes, ensuring strong observability and implementing best practices such as tagging conventions.
* Collaboration: Work closely with application teams and key stakeholders to understand requirements and proactively address monitoring needs.
* Troubleshooting & Incident Management: Utilize strong troubleshooting skills to resolve complex issues and integrate DataDog with ticketing systems (e.g., ServiceNow) for efficient incident management and tracking.
* AWS Integration: Leverage hands-on experience with AWS to manage and optimize cloud infrastructure monitoring.
* Network Infrastructure Monitoring: Implement SNMP integrations to enhance network infrastructure monitoring, using tools such as SolarWinds or others to ensure full system visibility.
* Documentation & Training: Create and maintain clear documentation for DataDog configurations and processes. Provide training to team members on best practices for monitoring and utilizing DataDog effectively.
* Project Leadership: Take ownership of projects from start to finish, ensuring continuous improvement and alignment with the client’s needs and business goals.
Qualifications:
* Experience with DataDog, including recent experience with features like RUM and Vulnerability Scanning.
* Proficiency in APM monitoring and observability tools.
* Advanced proficiency in English, both written and spoken, to ensure effective communication within an international remote team.
* Experience with AWS and cloud infrastructure monitoring.
* Strong understanding of SNMP and network monitoring tools such as SolarWinds.
* Hands-on experience with integrating monitoring platforms and ticketing systems (e.g., ServiceNow) for incident handling.
* Proactive, self-motivated, and capable of identifying and addressing gaps in monitoring and performance.
* Excellent interpersonal and communication skills, with the ability to work well with diverse teams.
* Ability to lead projects and drive successful outcomes from start to finish.
Preferred Qualifications:
* Experience creating and maintaining comprehensive documentation for monitoring solutions.
* Ability to mentor and guide team members on monitoring best practices and DataDog usage.