About UsAt MetaCTO, we specialize in helping startups and growing companies turn visionary ideas into successful digital products through expert app development and fractional CTO services. As aSite Reliability Engineer (SRE), you will play a critical role in ensuring the reliability, scalability, and security of the backend infrastructure that powers innovative applications for our clients. This role will involve managing cloud environments, optimizing databases, automating deployments, and improving system observability.
Job Description
As aSite Reliability Engineer (SRE) at MetaCTO, you will be responsible for designing, implementing, and maintaining highly available, scalable, and secure infrastructure solutions. You will collaborate with software engineers to improve system performance, automate operations, and ensure the smooth functioning of critical backend services. You’ll work extensively with cloud platforms like AWS, leveraging technologies such as Terraform, Docker, Kubernetes, and CI/CD pipelines to enhance system reliability.
ResponsibilitiesArchitect, build, and maintain cloud infrastructure onAWS(Lambda, EC2, RDS, S3, EKS, SQS, CloudWatch).
Manage and optimize databases (MySQL, PostgreSQL) for performance, reliability, and security.
Implementmonitoring, alerting, and loggingsolutions to ensure system health and performance, with specific experience usingZabbixandElastic Logging.
Design and maintainCI/CD pipelinesfor automated deployment and scaling of applications.
Work withcontainerization and orchestration toolssuch asDockerandKubernetes.
Develop and enforcesecurity best practicesfor cloud environments and infrastructure.
Automate operational processes usingInfrastructure-as-Code (Terraform, CloudFormation)and scripting languages like Python or Bash.
Troubleshoot and resolve infrastructure-related incidents and optimize system performance.
Collaborate with backend engineers to ensure high availability, fault tolerance, and scalable system design, with a strong focus onDjango-based applications.
Qualifications5-10 yearsof experience inSite Reliability Engineering (SRE), DevOps, or Cloud Engineeringroles.
Strong expertise inAWScloud services (EC2, RDS, S3, Lambda, CloudFront, EKS, SQS, IAM).
Hands-on experience withcontainerization (Docker) and orchestration (Kubernetes, ECS, or EKS).
Deep knowledge ofrelational databases (MySQL, PostgreSQL), including performance tuning, query optimization, monitoring, and migration management.
Proficiency inInfrastructure-as-Code toolssuch asTerraform, CloudFormation, or Pulumi.
Strong experience withCI/CD pipelinesand automation tools (GitHub Actions, Jenkins, CircleCI, or GitLab CI/CD).
Proficiency inmonitoring tools, specificallyZabbix, and logging solutions likeElastic Logging.
Scripting experience withPython, Bash, or Gofor automating operational tasks.
Experience working withDjango-based applicationsin a cloud environment.
Experience implementing security best practices for cloud-based applications.
Knowledge of distributed systems andmicroservices architecture.
Preferred SkillsAWS certifications (Solutions Architect, DevOps Engineer) are a plus.
Experience withserverless computingand event-driven architectures.
Familiarity withmessage queue services(SQS, RabbitMQ, Kafka).
Understanding ofzero-downtime deploymentsand disaster recovery strategies.
Position DetailsType:Full-Time
Location:100% Remote
Hours:US Pacific Time hours
How to ApplyIf you are passionate aboutscalability, automation, and reliability, and thrive in a collaborative, fast-paced environment, we’d love to hear from you. Please submit yourresumeand an optionalbrief cover letteroutlining your relevant experience.
MetaCTOis an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.