Work with the business team to gather the requirements and participate in the Agile planning meetings to finalize the scope. Actively participate in Story Time, Sprint Planning and Sprint Retrospective meetings. Develop queries using joins and partitions for huge data sets as per business requirements and load the filtered data from source to destination tables and validate the data. Develop shell scripts to schedule full and incremental load, check data quality. Implement optimization techniques in hive like partitioning tables, De-normalizing data and Bucketing and Spark techniques like Data Serialization and Broadcasting. Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow DAG’s. Troubleshoot performance issues using ETL/HiveQL tuning. Involve in performance tuning of Spark Applications for setting right level of Parallelism and memory tuning. Help team in building GCP native tables, data validation between different tables in BigQuery and solve production or support issues. Work with infrastructure team, troubleshooting connectivity issues within LDAP Ranger AD, Knox gateway, ODBC/JDBC connectivity issues, Kerberos accounts and keytabs. Automate running metadata sync between GCS (Google Cloud Storage) and GCP Hive. Import and export data into HDFS and Hive using Sqoop and migration of huge amounts of data from different databases (i.e. Oracle, SQL Server) to Hadoop. Develop Spark Jobs using Python (Pyspark) APIs and monitoring them. Use Spark SQL to create structured data by using data frame and querying from other data sources and Hive. Involve in Job management and Developed job processing scripts using Automatic scheduler. Work on call support over the weekend and monitor production critical jobs.
Position requires a Master’s degree in Computer Science or a related field
Master’s degree in Computer Science