Alkemet News

67 points

byhugocorreia90

a day ago |

10 comments
Hi HN, I'm Hugo. I've been building Rocky over the past month, shipping fast in the open. The binary is on GitHub Releases, `dagster-rocky` on PyPI, and the VS Code extension on the Marketplace. I held off on a broader announcement until the trust-system surface was coherent enough to talk about as one thing. The governance waveplan — column classification, per-env masking, 8-field audit trail on every run, `rocky compliance` rollup, role-graph reconciliation, retention policies — landed end-to-end last week in engine-v1.16.0 and rounded out in v1.17.4 (tagged 2026-04-26). That's the milestone I'd been waiting for.

The pitch: keep Databricks or Snowflake. Bring Rocky for the DAG. Rocky is a Rust-based control plane for warehouse pipelines. Storage and compute stay with your warehouse. Rocky owns the graph — dependencies, compile-time types, drift, incremental logic, cost, lineage, governance. The things your current stack can't give you because it doesn't own the DAG.

A few things I think are interesting:

- Branches + replay. `rocky branch create stg` gives you a logical copy of a pipeline's tables (schema-prefix today; native Delta SHALLOW CLONE and Snowflake zero-copy are next). `rocky replay <run_id>` reconstructs which SQL ran against which inputs. Git-grade workflow on a warehouse.

- Column-level lineage from the compiler, not a post-hoc graph crawl. The type checker traces columns through joins, CTEs, and windows. VS Code surfaces it inline via LSP.

- Governance as a first-class surface. Column classification tags plus per-env masking policies, applied to the warehouse via Unity Catalog (Databricks) or masking policies (Snowflake). 8-field audit trail on every run. `rocky compliance` rollup that CI can gate on. Role-graph reconciliation via SCIM + per-catalog GRANT. Retention policies with a warehouse-side drift probe.

- Cost attribution. Every run produces per-model cost (bytes, duration). `[budget]` blocks in `rocky.toml`; breaches fire a `budget_breach` hook event.

- Compile-time portability + blast radius. Dialect-divergence lint across Databricks / Snowflake / BigQuery / DuckDB (12 constructs). `SELECT *` downstream-impact lint.

- Schema-grounded AI. Generated SQL goes through the compiler — AI suggestions type-check before they can land.

What Rocky isn't:

- Not a warehouse — it's the control plane on top.

- Not a Fivetran replacement. `rocky load` handles files (CSV/Parquet/JSONL); for SaaS sources use Fivetran, Airbyte, or warehouse-native CDC.

- Not dbt Cloud — no hosted UI, no managed scheduler. First-class Dagster integration if you need orchestration.

Adapters: Databricks (GA), Snowflake (Beta), BigQuery (Beta), DuckDB (local dev / playground). Apache 2.0.

I'd love feedback on the trust-system framing, the governance surface (particularly classification-to-masking resolution in `rocky compile` and the `rocky compliance` CI gate), the branches/replay design, the cost-attribution primitives, or anything else that catches your eye. Happy to go deep in the thread.