- Responsible for developing scripts like python to run jobs in Pyspark. The concept of data transformation is involved here, by converting huge Tsv files to parquet file formats along with compression.
- Gather the requirements for every sprint and assign the priorities and time every user story takes
- Developed Apache presto and Apache drill setups in AWS EMR (Elastic Map Reduce) cluster, to combine multiple databases like Mysql and Hive. This enables to compare results like joins and inserts on various data sources controlling through single platform.
- Developing shell script, where the logs generated by the users are collected and stored in AWS S3 (Simple storage service) buckets. This includes the trace of all user activities and a good sign of security to identify cluster termination and to protect the data integrity.
- Worked on cloud formation techniques, viz to design a prototype of JSON file format, where this file itself automates in creating the services mentioned in it. This is the best alternative used in production environment rather than doing the tasks manually.
- Configured Apache Hue console and it’s hive-site.xml property files.
- Performed partitioning and Bucketing concepts in Apache Hive database, which improves the retrieval speed when someone performs a query.
- Created AWS RDS (Relational database services) to migrate the Hive metastore external to the EMR cluster.
- Worked on Dynamo DB No-SQL database as part of scheduling logs using python script.
- Used AWS Code Commit Repository to store their programming logics and script and have them again to their new clusters.
Qualification
Bachelor's degree in Computer Science or a closely related field