Design, develop, and maintain robust and scalable ETL workflows and data pipelines using tools like Hive, Spark, and Airflow.
Implement and manage data storage and processing solutions using Apache Hudi and BigQuery.
Develop and optimize data pipelines for structured and unstructured data in GCP environments, leveraging GCS for data storage.
Write clean, maintainable, and efficient code in Scala and Python to process and transform data.
Ensure data quality, integrity, and consistency by implementing appropriate data validation and monitoring techniques.
Work with cross-functional teams to understand business requirements and deliver data solutions that drive insights and decision-making.
Troubleshoot and resolve performance and scalability issues in data processing and pipelines.
Stay updated with the latest developments in big data technologies and tools and incorporate them into the workflow as appropriate.
Required Skills and Qualifications
Proven experience as a Data Engineer, preferably in a big data environment.
Expertise in Hive, Spark, and Apache Hudi for big data processing and storage.
Hands-on experience with BigQuery and Google Cloud Platform (GCP) services such as GCS, Dataflow, and Pub/Sub.
Strong programming skills in Scala and Python, with experience in building data pipelines and ETL processes.
Proficiency with workflow orchestration tools like Apache Airflow.
Solid understanding of data warehousing concepts, data modelling, and schema design.
Knowledge of distributed systems and parallel processing.
Strong problem-solving skills and ability to work with large datasets in a fast-paced environment.
Bachelor of Engineering, Master of Computer Application