Key Responsibilities
Data Pipeline Development: Design, develop, and maintain scalable data pipelines using PySpark to
process large volumes of data from various sources.
Data Integration: Integrate data from multiple data sources and formats, ensuring high data quality and
reliability.
Optimization: Optimize and tune data processing jobs for performance and cost-efficiency.
Collaboration: Work closely with data scientists, analysts, and other stakeholders to understand data
requirements and deliver high-quality data solutions.
ETL Processes: Develop and maintain ETL processes to extract, transform, and load data into data
warehouses and data lakes.
Data Quality: Implement data validation and monitoring processes to ensure data accuracy and
consistency.
Documentation: Document data engineering processes, workflows, and best practices.
Troubleshooting: Identify, troubleshoot, and resolve data-related issues promptly.
Required Qualifications
Experience: 3+ years of experience in data engineering or a related field.
Education: Bachelors degree in Computer Science, Information Technology, Engineering, or a related
field.
Technical Skills
Proficiency in PySpark and Python.
Strong knowledge of big data technologies such as Hadoop, Hive, and Spark.
Experience with cloud platforms (e.g., AWS, Azure, GCP) and their data services.
Familiarity with data warehousing solutions (e.g., Amazon Redshift, Google BigQuery, Snowflake).
Knowledge of relational and NoSQL databases (e.g., MySQL, MongoDB, Cassandra).
Data Processing: Experience with ETL/ELT processes and data pipeline orchestration tools (e.g., Apache
Airflow, Apache NiFi).
Problem-Solving: Strong analytical and problem-solving skills.
Communication: Excellent verbal and written communication skills, with the ability to explain complex
technical concepts to non-technical stakeholders.
Desired Skills and Experience
Python, ETL, Big Data Technologies, pyspark, data engineer, Microsoft Azure, aws, gcp
Bachelor's degree in Computer Science