Descripción del puesto
Role Overview:
We are seeking a Data Engineer who can contribute quickly in a production environment and work
effectively across cross-functional teams. This role is highly hands-on and focused on building,
optimizing, and maintaining scalable batch data pipelines using Databricks and AWS. Strong
communication skills are critical, as the engineer will collaborate directly with business stakeholders,
analysts, and data scientists.
Core Responsibilities:
1. Design, develop, and optimize ETL/ELT pipelines using Databricks, Apache Spark, and Delta Lake.
2. Implement performance tuning strategies for Spark-based batch workloads.
3. Develop cloud-based data solutions leveraging AWS services such as S3, Lambda, ECS, and
Kafka.
4. Build and maintain data ingestion and egress processes using APIs and Kafka.
5. Support real-time and near-real-time data processing (Structured Streaming is a plus).
6. Apply data governance, observability, and monitoring best practices (Unity Catalog or similar).
7. Troubleshoot data pipeline issues deeply, identifying root causes and implementing long-term fixes.
8. Collaborate closely with analysts, data scientists, and business stakeholders to deliver data
solutions aligned with business needs.
Required Technical Skills:
1. Strong Python proficiency for data engineering (Pandas, NumPy).
2. Hands-on experience with PySpark and distributed data processing.
3. Solid experience with Databricks (Spark, Delta Lake, MLflow, DLT, PySQL).
4. Experience working with AWS, particularly ECS, S3, Lambda, and related services.
5. Strong SQL skills, including performance tuning and optimization.
6. Experience with Apache Spark internals and performance considerations.