Job Description
Key Responsibilities:
- Architect and Design Solutions:
Lead the architecture and design of Databricks-based data solutions that support data engineering, machine learning, and real-time analytics. - Data Pipeline Design:
Design and implement ETL (Extract, Transform, Load) pipelines using Databricks, Apache Spark, and other big data tools to process and integrate large-scale data from multiple sources. - Collaborate with Stakeholders:
Work with business and data teams to understand requirements, identify opportunities for automation, and design solutions that improve data workflows. - Optimize Data Architecture:
Create highly optimized, scalable, and cost-effective architectures for processing large data sets and managing big data workloads using Databricks, Delta Lake, and Apache Spark. - Implement Best Practices:
Define and promote best practices for Databricks implementation, including data governance, security, performance optimization, and monitoring. - Manage Databricks Clusters:
Manage and optimize Databricks clusters for performance, cost, and reliability. Troubleshoot performance issues and optimize the use of cloud resources. - Data Governance and Security:
Implement best practices for data governance, security, and compliance on the Databricks platform to ensure that data processing and storage meet organizational and regulatory standards. - Automation and Optimization:
Automate repetitive tasks, streamline data processes, and optimize data workflows to improve efficiency and reduce operational costs. - Mentorship and Training:
Mentor and provide guidance to junior engineers, ensuring the team follows best practices in the development of data pipelines and analytics solutions. - Keep Up-to-Date with Trends:
Stay current with emerging technologies in the big data and cloud space, and recommend new solutions or improvements to existing processes.
Required Skills & Qualifications:
- Technical Expertise:
- Extensive experience with Databricks, Apache Spark, and cloud platforms (AWS, Azure, or GCP).
- Proficiency in programming languages such as Python, Scala, or SQL.
- Strong understanding of distributed computing, data modeling, and data storage technologies.
- Hands-on experience with Delta Lake, Spark SQL, and MLlib.
- Experience with Cloud Services:
- Expertise in deploying and managing data platforms and workloads on cloud environments like AWS, Azure, or GCP.
- Familiarity with cloud-native services like S3, Redshift, Azure Blob Storage, and BigQuery.
- Data Engineering Skills:
- Experience designing, building, and optimizing ETL data pipelines.
- Familiarity with data warehousing concepts, OLAP, and OLTP systems.
- Machine Learning (ML) Knowledge:
- Experience in integrating machine learning workflows with Databricks, building models, and automating model deployment.
- Leadership and Collaboration:
- Strong leadership and communication skills to interact with both technical and non-technical stakeholders.
- Experience in leading cross-functional teams and mentoring junior team members.
Preferred Skills:
- Advanced Databricks Knowledge:
In-depth experience with Databricks components, such as notebooks, jobs, and collaboration features. - DevOps & CI/CD:
Experience with DevOps practices, automation, and CI/CD pipelines in data engineering. - Data Governance:
Strong knowledge of data governance principles, such as metadata management, data lineage, and data quality. - Certifications:
- Databricks Certified Associate Developer for Apache Spark.
- Cloud certifications (e.g., AWS Certified Solutions Architect, Azure Solutions Architect Expert).
Education & Experience:
- Education:
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field (or equivalent work experience). - Experience:
5+ years of experience in data architecture, engineering, and working with cloud platforms (preferably with Databricks and Apache Spark)