Description

Role Overview
 

We are seeking a highly skilled Software Development Engineer (SDE) with advanced expertise in Python and SQL for handling large datasets on local machines. The ideal candidate will be proficient in developing, optimizing, and maintaining data pipelines while processing and querying large files of various formats. This role requires in-depth knowledge of Python for data engineering, along with advanced SQL query optimization techniques for large-scale data operations.

Responsibilities:

  • Data Pipeline Development:Design and develop scalable data pipelines to handle, transform, and process large datasets efficiently on local systems.
  • File Handling and Querying Large Datasets:Work with large files in various formats (e.g., CSV, JSON, Parquet) and use Python to efficiently read, process, and query the data while ensuring performance and scalability.
  • Advanced Python Programming:Utilize advanced Python techniques such as:
  • Data structures optimization for better memory management and speed.
  • Custom generators and iterators for processing large data lazily to minimize memory usage.
  • Parallel and asynchronous processing using multi-threading, multi-processing, and asyncio to optimize data pipelines.
  • Performance Optimization:Apply memory and performance tuning techniques, including:
  • Handling large datasets using chunked file processing and streaming.
  • Managing memory through compression techniques (e.g., gzip, zlib), and optimizing code with profiling tools like cProfile and memory_profiler.
  • Advanced Data Processing with Pandas & NumPy:
  • Optimize Pandas operations and use NumPy’s vectorized operations for efficient numerical computations on large datasets.
  • SQL Query Optimization:Write and optimize advanced SQL queries for large datasets with:
  • Indexing and partitioning to improve query performance.
  • Advanced SQL concepts like window functionsCTEs, and subqueries.
  • Query plan analysis using EXPLAIN and performance profiling tools.
  • Bulk Data Operations and Aggregation:Efficiently manage bulk inserts, updates, and deletions on large databases. Perform advanced aggregations, joins, and set operations to extract insights from large datasets.
  • Data Partitioning and Storage Management:Implement partitioning techniques for large files, ensuring that files are split into manageable chunks for fast processing and querying.
  • Handling Complex Data Pipelines:Build and manage ETL (Extract, Transform, Load) pipelines to process and transform data efficiently, ensuring high performance even with large volumes of data.
  • Parallel Processing Tools:Use libraries like Joblib and Dask to parallelize operations for faster execution, and PySpark knowledge is a plus.
  • Data Quality Assurance:Implement robust validation and error-handling mechanisms within data pipelines to ensure data quality.

Required Skills and Experience:

  • Proficiency in Python for Data Engineering:
  • Strong experience with Python libraries like PandasNumPy, and SQLAlchemy for efficient data handling.
  • Expertise in Python’s memory management, file handling, and data streaming techniques.
  • Proficient in using generatorsiterators, and advanced parallel processing techniques to optimize large-scale data operations.
  • Advanced SQL Expertise:
  • Proven ability to write and optimize complex SQL queries on large datasets.
  • Expertise in query optimization techniques including indexing, partitioning, and analyzing execution plans.
  • Experience with window functionsCTEs, and aggregations for large-scale data analytics.
  • Data Processing Techniques:
  • Experience with processing and querying large datasets using advanced Python techniques (e.g., generators, iterators, multi-threading, multi-processing).
  • Ability to handle large files efficiently by using chunk processing, streaming, and compressed data formats (e.g., GZIPSnappy).
  • File Formats and Data Storage:
  • Extensive experience handling data in formats like CSV, JSON, Parquet.
  • Knowledge of advanced file management and storage optimization techniques.
  • Performance Profiling and Tuning:
  • Experience with tools like cProfileline_profiler, and memory_profiler for profiling code and identifying performance bottlenecks.
  • Parallel and Asynchronous Programming:
  • Advanced knowledge of multi-threadingmulti-processing, and asyncio for processing large datasets efficiently.

Optional Skills:

  • Experience with PySpark for distributed data processing is a plus.
  • Familiarity with cloud storage systems like AWS S3 or Google Cloud Storage is desirable

Education

Any Graduate