ML Engineer / Pyspark
Job Description
Role: ML Engineer – PySpark & Statistical modeling
Experience: 3+ Years
Location: Remote with 1 visit to client office in Bangalore, during onboarding and 1 more (if required)
Mandatory Skills: Python (3.9+), Py Spark & Spark Internals, Databricks, Statistics/ML Libraries (stats models, scikit-learn, SciPy, Pandas, NumPy DID, Synthetic Control, A/B testing, hypothesis testing, panel data methods), API Development, Azure Cloud Platform, Kubernetes, Docker, Py Test.
Role Overview: We're looking for an ML Engineer to join our Test & Learn Platform team. You'll build and scale our experimentation and causal inference services — from statistical engines to API integrations and cloud pipelines — empowering business teams globally to make data-driven decisions.
Responsibilities:
1. Develop and maintain statistical/ML modules (DID, Synthetic Control, A/B Testing, Multi-Treatment Effects) in Python
2. Build and extend Fast API services and integrate them with our web application via SDK wrappers
3. Design and optimize large-scale data pipelines using PySpark, Delta Lake, and Azure Data Lake
4. Profile and resolve OOM issues in PySpark jobs - optimize memory allocation, partitioning, broadcast joins, caching strategies, and Spark configurations
5. Deploy and manage workloads on Databricks, including job clusters, notebooks, and Delta Lake tables
6. Containerize and deploy services using Docker, Kubernetes, and CI/CD pipelines
7. Ensure code quality and security via Sonar Cloud, Snyk, and PyTest 8. Collaborate with data scientists and product teams to translate research into production-ready modules
Requirements :
1. 3+ years of production experience in Python (3.9+)
2. PySpark & Spark Internals - strong experience with Spark memory model, executor tuning, shuffle optimization, and diagnosing/resolving OOM errors (broadcast thresholds, partition skew, spill-to-disk, GC tuning)
3. Databricks - hands-on with job orchestration, cluster configuration, notebook workflows, and Delta Lake optimization (Z-ordering, compaction, caching)
4. Causal Inference & Experimentation - DID, synthetic control, A/B testing, hypothesis testing, panel data methods
5. Statistics/ML Libraries - statsmodels, scikit-learn, scipy, pandas, numpy
6. API Development - building RESTful services with FastAPI (or similar)
7. Cloud (Azure) - Azure Storage, Azure ML, Data Lake
8. Docker & Kubernetes - containerization and orchestration for ML workloads
9. Testing - writing robust unit/integration tests with pytest
Good-to-Have:
1. Experience with Celery/Redis for async task orchestration
2. Familiarity with Polars, PyArrow, or SQL Alchemy
3. Background in econometrics or experimental design
4. Spark UI profiling and performance benchmarking
5. CI/CD tooling (Sonar Cloud, Snyk, GitHub Actions)