Description

Key Responsibilities

Data Pipeline Development: Design, develop, and maintain scalable data pipelines using PySpark to

process large volumes of data from various sources.

Data Integration: Integrate data from multiple data sources and formats, ensuring high data quality and

reliability.

Optimization: Optimize and tune data processing jobs for performance and cost-efficiency.

Collaboration: Work closely with data scientists, analysts, and other stakeholders to understand data

requirements and deliver high-quality data solutions.

ETL Processes: Develop and maintain ETL processes to extract, transform, and load data into data

warehouses and data lakes.

Data Quality: Implement data validation and monitoring processes to ensure data accuracy and

consistency.

Documentation: Document data engineering processes, workflows, and best practices.

Troubleshooting: Identify, troubleshoot, and resolve data-related issues promptly.

Required Qualifications

Experience: 3+ years of experience in data engineering or a related field.

Education: Bachelors degree in Computer Science, Information Technology, Engineering, or a related

field.

Technical Skills

Proficiency in PySpark and Python.

Strong knowledge of big data technologies such as Hadoop, Hive, and Spark.

Experience with cloud platforms (e.g., AWS, Azure, GCP) and their data services.

Familiarity with data warehousing solutions (e.g., Amazon Redshift, Google BigQuery, Snowflake).

Knowledge of relational and NoSQL databases (e.g., MySQL, MongoDB, Cassandra).

Data Processing: Experience with ETL/ELT processes and data pipeline orchestration tools (e.g., Apache

Airflow, Apache NiFi).

Problem-Solving: Strong analytical and problem-solving skills.

Communication: Excellent verbal and written communication skills, with the ability to explain complex

technical concepts to non-technical stakeholders.


Desired Skills and Experience
Python, ETL, Big Data Technologies, pyspark, data engineer, Microsoft Azure, aws, gcp

Education

Bachelor's degree in Computer Science