We have an immediate long-term opportunity with one of our prime clients for a position of Data Engineer to work on Remote basis.
MUST HAVES (TOP 3):
- Clickhouse
- Kubernetes - understanding infra
- AWS
- Need to know how everything connects
- At scale and multi-tenant is a plus
- Have spoken to some who did have it at scale and it was small scale and was not large enough for what they need
PROJECT DETAILS (Size, Scale, Scope):
- What is the scope of work that is being completed? Building out a data lake
- What part of the project is this resource supporting? Building out the data lake - and creating the clickhouse database across the org to scale
- Day to day responsibilities
- Are there deliverables and milestones the team is working towards? If so, what? Working towards the go live in 12/31 - at which point starting 1/1 they will have 12 months to scale the data lake across the entire enterprise
- What is the next phase of this project? Go live with production on 12/31
Transcription Notes:
The meeting focused on discussing the development and implementation of a monitoring data lake, with specific emphasis on the technologies and strategies involved. Here are the key points:
- Data Pipeline and Storage: Frohman detailed the process of collecting logs from various sources, structuring them through a pipeline using vectored dev for rate limiting and security, and storing them in ClickHouse as the database. 1
- Querying and Visualization: They plan to offer Grafana for standard querying but will support other tools like Databricks, Power BI, and Excel. An open-source tool called Keep will serve as the middle layer for event management and rules engine. 2
- Event Management: Keep will query the lake for triggering events in ServiceNow, acting as an intermediary due to ServiceNow's inability to handle millions of rules directly. 3
- Scale and Talent Needs: The platform aims to handle 100 petabytes a month initially, requiring sharp talent, especially in ClickHouse and vectored dev. They are seeking vendor support for architecture and operational expertise. 4
- Challenges and Solutions: The discussion also covered the challenges of scaling, the need for a multi-tenant solution, and the importance of optimizing SQL queries for efficiency. The goal is to consolidate various data sources into a single, queryable database to facilitate correlation and analysis. 5
- Vendor and Staffing Strategy: Frohman mentioned the plan to host the platform themselves due to governance and legal timeframes, with a preference for platform-as-a-service. They are exploring offshore and onshore staffing options, with specific bill rates and skill sets in mind