New to SQL? Use Start here first (tables, rows, reading a query in English). Interview / on-call? Use Basics → Expert below. Snowflake dialect is close to ANSI SQL—always confirm function names in docs for your engine.
SQL is a language for asking questions about tables of data. You describe what you want; the database figures out how to fetch it (mostly—you still learn good habits).
amount).id).SELECT in EnglishLogical order to reason about (not always execution order): FROM (where is the data?) → WHERE (which rows keep?) → SELECT (which columns show?) → ORDER BY (how to sort?).
SELECT name, amount FROM sales WHERE region = 'EU' ORDER BY amount DESC;
Plain English: “From the sales table, keep only rows where region is EU, show name and amount, sort biggest amount first.”
= NULL instead of IS NULL—two unknowns are not equal in SQL.'EU'); double quotes are often for identifiers.FROM a, b WHERE …)—prefer explicit JOIN … ON … so you do not accidentally multiply rows.Think: FROM → WHERE → SELECT → ORDER (logical order you reason about, not always how the optimizer runs).
SELECT columns or expressions; AS renames output.FROM one table or a join of tables / subqueries.WHERE filters rows before grouping (no aggregates here in standard SQL).ORDER BY sorts; LIMIT / OFFSET cap rows (syntax varies: SQL Server uses TOP, etc.).SELECT id, name, amount * 1.1 AS amount_with_tax FROM sales WHERE region = 'EU' AND amount > 0 ORDER BY amount DESC LIMIT 100;
| Join | Meaning in one line |
|---|---|
INNER JOIN | Only rows that match both sides. |
LEFT JOIN | All left rows; right columns NULL if no match (great for “missing dimension” checks). |
FULL OUTER JOIN | Keep both sides; unmatched columns NULL. |
CROSS JOIN | Cartesian product—easy to explode row count by mistake. |
NULL = NULL is unknown, not true—use IS NULL / IS NOT NULL.COALESCE(a, b) first non-null; NULLIF(x, y) null if equal.SELECT region, COUNT(*) AS orders, SUM(amount) AS revenue FROM sales WHERE order_date >= DATE '2026-01-01' GROUP BY region HAVING SUM(amount) > 10000;
HAVING filters after aggregation (like a WHERE for groups).
Readable “pipeline in SQL”: name intermediate steps. Interview win: you structure logic instead of nested subqueries only.
WITH daily AS ( SELECT order_date::DATE AS d, SUM(amount) AS rev FROM sales GROUP BY 1 ) SELECT d, rev, AVG(rev) OVER (ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rev_7d_avg FROM daily;
CASE — conditional valuesCASE WHEN amount >= 1000 THEN 'large' WHEN amount >= 100 THEN 'medium' ELSE 'small' END AS tier
UNION combines results (dedup); UNION ALL keeps duplicates (usually faster). INTERSECT / EXCEPT where supported.
EXISTS (SELECT 1 FROM … WHERE …) stops once a match exists—often clearer than IN (subquery) for large sets.
Compute per row using a window of related rows without collapsing to one row per group. Pattern: function() OVER (PARTITION BY … ORDER BY … frame).
SELECT user_id, event_time, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) AS seq, LAG(event_type, 1) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_type FROM events;
ROW_NUMBER() unique rank; RANK() ties skip numbers; DENSE_RANK() ties don’t skip.SUM(x) OVER (…) running totals when you set a frame (e.g. rows between unbounded preceding and current row).QUALIFYFilter on window results without a subquery—great for “latest row per key” patterns (verify support on your engine).
MERGE (upsert)One statement: match rows, then WHEN MATCHED update/delete, WHEN NOT MATCHED insert—common in slowly changing dimensions and sync jobs.
EXPLAIN / profile as “bytes scanned, spill, skew”—not magic.Read this first if you are new: A warehouse is built for big questions over history, not for replacing your checkout database. “Fresher data” usually means shorter gaps between batches (minutes) or continuous small loads, then SQL that only touches new or changed rows instead of re-reading everything.
“Real time” in warehouses is usually micro-batch or continuous ingest, then SQL transforms on fresh data—not single-row OLTP inside the warehouse.
MERGE or delete-by-partition patterns.| Pattern | SQL idea |
|---|---|
| Incremental load | Filter on updated_at > :watermark or use change feeds / streams (Snowflake) / CDC tables. |
| Idempotent load | MERGE on natural key; or delete+insert partition; avoid blind duplicates. |
| Late-arriving events | Windows with RANGE / watermarks in streaming engines; in SQL, QUALIFY or self-join on “as of” time. |
| Sessionization | SUM(is_new_session) OVER (ORDER BY time) or gap-based grouping—classic window interview problem. |
Pair with SnowPro study for Snowflake-specific objects (VARIANT, streams, tasks) on top of this SQL base.
Short model answers—adapt to the engine (Postgres, SQL Server, Snowflake, etc.).
Structured Query Language: a declarative language to define, query, and manipulate relational data. You describe sets of rows and columns; the engine plans how to execute.
JOIN in plain words?Combining rows from two tables when they match on a key (e.g. orders and customers on customer_id). Inner join keeps only matches; left join keeps all rows from the left table and fills missing right-side data with NULL.
A table is stored data with a name. A query result is a temporary set of rows produced by SELECT—you can save it as a view, table, or CTE depending on the product.
WHERE and HAVING?WHERE filters rows before aggregation. HAVING filters groups after GROUP BY (can use aggregates like SUM(amount) > 100).
INNER JOIN vs LEFT JOIN.Inner: only rows that match on both tables. Left: all rows from the left table; columns from the right are NULL when there is no match—useful for “what’s missing?” checks.
UNION and UNION ALL?UNION combines result sets and removes duplicate rows (extra sort/hash work). UNION ALL keeps everything including duplicates—usually faster when you know duplicates aren’t a problem.
Primary key: uniquely identifies a row in a table (often surrogate id). Foreign key: column(s) referencing another table’s key—enforces relational integrity when the DB enforces constraints.
Organizing tables to reduce redundancy: e.g. put repeating groups in their own table, use keys to relate. Tradeoff: more joins vs update anomalies. Denormalize selectively for read-heavy analytics.
An index is a separate structure to find rows faster (like a book index). Helps selective lookups; hurts bulk loads and heavy writes (maintain index). Wrong index may be ignored by the optimizer if the predicate isn’t selective.
NULLs behave in comparisons and joins?NULL = NULL is unknown, not true—use IS NULL. In joins, NULL doesn’t match NULL unless you use special predicates—classic source of “missing rows” bugs.
A subquery that references columns from an outer query. It can run once per outer row (expensive)—often replaceable with a join or window for clarity and performance.
GROUP BY?When you need per-row output with aggregates over a partition (running total, rank, previous row) without collapsing to one row per group.
SELECT user_id, COUNT(*) AS c FROM events GROUP BY user_id HAVING COUNT(*) > 1;
A transaction groups operations that commit or roll back together. ACID: Atomicity (all or nothing), Consistency (valid state), Isolation (concurrent transactions don’t corrupt each other—levels vary), Durability (after commit, survives crashes).
DELETE, TRUNCATE, and DROP?DELETE removes rows (can filter, logged row-wise). TRUNCATE clears the table quickly (usually minimal row logging; resets storage semantics vary). DROP removes the table object entirely.
A predicate can use an index/zone map—e.g. WHERE order_date >= '2026-01-01'. Often not sargable: WHERE YEAR(order_date) = 2026 because a function wraps the column.
LIMIT/OFFSET scale?Large OFFSET scans/skips many rows each page—gets slow. Better patterns: keyset pagination (WHERE id > :last_id ORDER BY id LIMIT 50) or cursors—depends on product.
Join order and types (nested loop vs hash), estimated vs actual rows (cardinality mis-estimates), spills to disk, which indexes were used, and biggest cost operators to target first.
See also Python cheat sheet for paired data-engineering interviews.