Algotale (Safe Security)-DevOps Engineer
About Algotale (Client-Safe Security)
Algotale is a data-centric IT consulting and staffing firm dedicated to reshaping industries through AI, machine learning, and cutting-edge cloud solutions. We pride ourselves on building resilient digital infrastructures and empowering businesses to unlock the true potential of their data. As we scale our AI-driven platforms, we are looking for a Senior DevOps Engineer with an SRE mindset to ensure our systems are fast, reliable, and infinitely scalable.
The Role
We are seeking an expert DevOps/Site Reliability Engineer to bridge the gap between development and operations. You will be the architect of our cloud infrastructure on AWS, ensuring that our Kubernetes-orchestrated services run with high availability and efficiency. With over 5 years of experience, you will lead the charge in automation, performance tuning, and infrastructure security.
Key Responsibilities
- Infrastructure Orchestration: Own, scale, and optimize multi-region production workloads using Kubernetes (EKS). Design Helm charts and maintain infrastructure as code (IaC) using Terraform.
- Data & AI Pipelines Support: Architect and monitor resilient, high-throughput message queues and event-driven architectures utilizing Apache Kafka and AWS SQS to ingest real-time cybersecurity telemetry for our AI modeling engines.
- Automation & Scripting: Write clean, production-grade automation scripts and internal tools using Python to eliminate manual interventions, handle auto-scaling, and manage data life cycles.
- CI/CD Deployment: Build, secure, and maintain automated CI/CD deployment pipelines to safely ship code multiple times a day to Dev, Staging, and Production environments with zero downtime.
- Security & Compliance (DevSecOps): Implement shift-left security practices. Integrate automated vulnerability scanning (SAST/DAST, container scanning) into build pipelines, manage IAM roles, network isolation, and encryption to meet stringent enterprise compliance standards (SOC2, ISO).
- Observability & Reliability: Set up robust logging, metrics monitoring, and alerting frameworks (e.g., Prometheus, Grafana, ELK, Datadog) to proactively guarantee 99.9% uptime for our core AI SaaS platform.
Requirement Specification (Qualifications & Experience)
Technical Must-Haves:
- Experience: Minimum 5 years of dedicated experience in DevOps / Site Reliability Engineering roles within a SaaS production environment.
- Containerization & Orchestration: Advanced, hands-on production experience managing Kubernetes clusters at scale (cluster upgrades, network policies, namespace isolation, ingress control).
- Cloud Expertise: Strong command over AWS core services (EC2, VPC, EKS, IAM, RDS, S3, Route53, CloudWatch).
- Advanced Scripting: Strong proficiency in Python for automation, system utilities, and cloud engineering tasks (beyond basic bash scripting).
- Streaming & Messaging Tech: Proven track record of configuring, tuning, and troubleshooting distributed messaging queues, specifically Apache Kafka or AWS SQS under heavy data loads.
- Infrastructure as Code: Strong experience with Terraform or CloudFormation.
Good-to-Have / Preferred:
- Experience working in cybersecurity, fintech, or highly regulated enterprise software companies.
- Experience setting up infrastructure specialized for AI/ML workloads (handling GPU nodes, managing pipelines for LLMs, or orchestrating vector database deployments).
- Certified Kubernetes Administrator (CKA) or AWS Certified DevOps Engineer Professional.