JD:
Responsibilities:
• Design, develop, and maintain data pipelines using DataBricks and PySpark to process and manipulate large scale datasets.
• Proven experience in optimizing Apache Spark batch processing workflows.
• Extensive experience in building and maintaining streaming data pipelines.
• Optimize and finetune existing DataBricks jobs and PySpark scripts for enhanced performance and reliability.
• Troubleshoot issues related to data pipelines, identify bottlenecks, and implement effective solutions.
• Implement best practices for data governance, security, and compliance within DataBricks environments.
• Work closely with Data Scientists and Analysts to support their data requirements and enable efficient access to relevant datasets.
• Stay updated with industry trends and advancements in DataBricks and PySpark technologies to propose and implement innovative solutions.
• Demonstrated expertise in optimizing systems for low-latency and high-throughput performance.
• Proficiency in using Spark SQL and DataFrame API for dynamic data transformations.
• Experience with using programming languages such as Python or Scala to implement advanced filtering logic in Databricks notebooks or scripts.
• Familiarity with the principles of distributed systems and their application in message broking.
• Collaborate with cross functional teams to gather requirements, understand data needs, and implement scalable solutions.
Requirements:
• Bachelor’s or master’s degree in computer science, Engineering, or a related field.
• 5+ years of proven experience as a Data Engineer with a strong emphasis on DataBricks.
• Experience with AWS tool is mandatory, Minimum fundamental knowledge is must.
• Proficiency in PySpark and extensive hands-on experience in building and optimizing data pipelines using DataBricks.
• Solid understanding of different components within DataBricks such as clusters, notebooks, jobs, and libraries.
• Strong knowledge of SQL, data modeling, and ETL processes.
Bachelor's degree