Job Description:
We are seeking a skilled Python PySpark Developer to join our data engineering team. The ideal candidate will have experience in building scalable data pipelines and working with large datasets in a distributed environment.
Key Responsibilities:
- Design, develop, and maintain robust data pipelines using PySpark.
- Collaborate with data scientists and analysts to understand data requirements and deliver solutions.
- Optimize and tune Spark jobs for performance and scalability.
- Write clean, maintainable, and efficient code.
- Perform data validation and quality checks on datasets.
- Troubleshoot and debug issues in data processing workflows.
- Document technical specifications and processes.
- Stay updated with the latest technologies and best practices in data engineering.
Qualifications:
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- 3+ years of experience in Python programming.
- Strong experience with Apache Spark and PySpark.
- Familiarity with data processing frameworks and ETL tools.
- Knowledge of SQL and experience with relational databases (e.g., PostgreSQL, MySQL).
- Experience with cloud platforms (e.g., AWS, Azure, GCP) is a plus.
- Understanding of data warehousing concepts and architectures.
- Strong problem-solving skills and attention to detail.
- Excellent communication and teamwork abilities.
Preferred Skills:
- Experience with distributed computing concepts.
- Familiarity with machine learning libraries (e.g., scikit-learn, TensorFlow).
- Knowledge of data visualization tools (e.g., Tableau, Power BI).