SQL — basics → expert (interview & real time)

New to SQL? Use Start here first (tables, rows, reading a query in English). Interview / on-call? Use Basics → Expert below. Snowflake dialect is close to ANSI SQL—always confirm function names in docs for your engine.

Start here — if you are a student or fresher

SQL is a language for asking questions about tables of data. You describe what you want; the database figures out how to fetch it (mostly—you still learn good habits).

Words that matter

Database: a collection of related data managed by software.
Table: like one sheet in a spreadsheet—named columns, many rows.
Row: one record (e.g. one order). Column: one field on every row (e.g. amount).
Primary key: column(s) that uniquely identify a row (often an id).

Read a `SELECT` in English

Logical order to reason about (not always execution order): FROM (where is the data?) → WHERE (which rows keep?) → SELECT (which columns show?) → ORDER BY (how to sort?).

SELECT name, amount
FROM sales
WHERE region = 'EU'
ORDER BY amount DESC;

Plain English: “From the sales table, keep only rows where region is EU, show name and amount, sort biggest amount first.”

Common fresher mistakes

Using = NULL instead of IS NULL—two unknowns are not equal in SQL.
Mixing quotes: text uses single quotes in standard SQL ('EU'); double quotes are often for identifiers.
Old-style comma joins (FROM a, b WHERE …)—prefer explicit JOIN … ON … so you do not accidentally multiply rows.

Next step: When the above feels obvious, jump to Basics for joins and null-safe patterns.

Basics — read and filter data

Think: FROM → WHERE → SELECT → ORDER (logical order you reason about, not always how the optimizer runs).

Core statements

SELECT columns or expressions; AS renames output.
FROM one table or a join of tables / subqueries.
WHERE filters rows before grouping (no aggregates here in standard SQL).
ORDER BY sorts; LIMIT / OFFSET cap rows (syntax varies: SQL Server uses TOP, etc.).

SELECT id, name, amount * 1.1 AS amount_with_tax
FROM sales
WHERE region = 'EU' AND amount > 0
ORDER BY amount DESC
LIMIT 100;

Joins (interview favourite)

Join	Meaning in one line
`INNER JOIN`	Only rows that match both sides.
`LEFT JOIN`	All left rows; right columns NULL if no match (great for “missing dimension” checks).
`FULL OUTER JOIN`	Keep both sides; unmatched columns NULL.
`CROSS JOIN`	Cartesian product—easy to explode row count by mistake.

Nulls and safety

NULL = NULL is unknown, not true—use IS NULL / IS NOT NULL.
COALESCE(a, b) first non-null; NULLIF(x, y) null if equal.

Intermediate — aggregates, shapes, reuse

Group and filter groups

SELECT region, COUNT(*) AS orders, SUM(amount) AS revenue
FROM sales
WHERE order_date >= DATE '2026-01-01'
GROUP BY region
HAVING SUM(amount) > 10000;

HAVING filters after aggregation (like a WHERE for groups).

Common Table Expressions (CTEs)

Readable “pipeline in SQL”: name intermediate steps. Interview win: you structure logic instead of nested subqueries only.

WITH daily AS (
  SELECT order_date::DATE AS d, SUM(amount) AS rev FROM sales GROUP BY 1
)
SELECT d, rev, AVG(rev) OVER (ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rev_7d_avg
FROM daily;

`CASE` — conditional values

CASE WHEN amount >= 1000 THEN 'large' WHEN amount >= 100 THEN 'medium' ELSE 'small' END AS tier

Set operations

UNION combines results (dedup); UNION ALL keeps duplicates (usually faster). INTERSECT / EXCEPT where supported.

Semi-join patterns

EXISTS (SELECT 1 FROM … WHERE …) stops once a match exists—often clearer than IN (subquery) for large sets.

Expert — windows, analytics, SQL that “feels like code”

Window functions

Compute per row using a window of related rows without collapsing to one row per group. Pattern: function() OVER (PARTITION BY … ORDER BY … frame).

SELECT
  user_id,
  event_time,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) AS seq,
  LAG(event_type, 1) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_type
FROM events;

ROW_NUMBER() unique rank; RANK() ties skip numbers; DENSE_RANK() ties don’t skip.
SUM(x) OVER (…) running totals when you set a frame (e.g. rows between unbounded preceding and current row).

Snowflake / modern dialect: `QUALIFY`

Filter on window results without a subquery—great for “latest row per key” patterns (verify support on your engine).

`MERGE` (upsert)

One statement: match rows, then WHEN MATCHED update/delete, WHEN NOT MATCHED insert—common in slowly changing dimensions and sync jobs.

Performance mindset (interview gold)

Sargable predicates: filter on raw columns when possible; wrapping columns in functions can block index / zone-map use.
Cardinality: know when nested loops vs hash join wins (conceptually).
Read EXPLAIN / profile as “bytes scanned, spill, skew”—not magic.

Real time & pipelines (SQL’s role)

Read this first if you are new: A warehouse is built for big questions over history, not for replacing your checkout database. “Fresher data” usually means shorter gaps between batches (minutes) or continuous small loads, then SQL that only touches new or changed rows instead of re-reading everything.

“Real time” in warehouses is usually micro-batch or continuous ingest, then SQL transforms on fresh data—not single-row OLTP inside the warehouse.

Vocabulary (one line each)

Watermark: a saved “we processed up to here” timestamp or ID so the next run knows what is new.
Idempotency: running the same load twice does not double-count—usually natural keys + MERGE or delete-by-partition patterns.
At-least-once delivery: the pipeline may deliver duplicates; your SQL must tolerate or dedupe them.

Why micro-batch is common

Streaming engines (Kafka, Flink, …) excel at per-event windows; warehouses excel at set-based SQL—teams often meet in the middle with small files or micro-batches landing every minute.
Dashboards rarely need millisecond freshness; they need correct, bounded-latency numbers and predictable cost.
Full re-scans of huge history every run explode credits—incremental SQL is the cost lever.

Pattern	SQL idea
Incremental load	Filter on `updated_at > :watermark` or use change feeds / streams (Snowflake) / CDC tables.
Idempotent load	`MERGE` on natural key; or delete+insert partition; avoid blind duplicates.
Late-arriving events	Windows with `RANGE` / watermarks in streaming engines; in SQL, `QUALIFY` or self-join on “as of” time.
Sessionization	`SUM(is_new_session) OVER (ORDER BY time)` or gap-based grouping—classic window interview problem.

Realtime interview line: “We bounded latency at ingest (seconds), then used incremental SQL downstream so dashboards didn’t full-scan history every run.”

Interview angles — how to sound experienced

Clarify the question: engine? data size? latency SLA? idempotency?
State tradeoffs: readability vs one giant query; CTE materialization behavior varies by platform.
One concrete example: “We replaced a correlated subquery with a window because …”
Testing: row counts, checksums, NULL rates, duplicate keys—how you’d prove correctness after a rewrite.
Honesty: “I’d validate syntax in docs for this engine” beats bluffing dialect trivia.

Pair with SnowPro study for Snowflake-specific objects (VARIANT, streams, tasks) on top of this SQL base.

Top SQL interview questions (click to expand)

Short model answers—adapt to the engine (Postgres, SQL Server, Snowflake, etc.).

What is SQL?

Structured Query Language: a declarative language to define, query, and manipulate relational data. You describe sets of rows and columns; the engine plans how to execute.

What is a JOIN in plain words?

Combining rows from two tables when they match on a key (e.g. orders and customers on customer_id). Inner join keeps only matches; left join keeps all rows from the left table and fills missing right-side data with NULL.

What is a table vs a query result?

A table is stored data with a name. A query result is a temporary set of rows produced by SELECT—you can save it as a view, table, or CTE depending on the product.

What is the difference between WHERE and HAVING?

WHERE filters rows before aggregation. HAVING filters groups after GROUP BY (can use aggregates like SUM(amount) > 100).

Explain INNER JOIN vs LEFT JOIN.

Inner: only rows that match on both tables. Left: all rows from the left table; columns from the right are NULL when there is no match—useful for “what’s missing?” checks.

What is the difference between UNION and UNION ALL?

UNION combines result sets and removes duplicate rows (extra sort/hash work). UNION ALL keeps everything including duplicates—usually faster when you know duplicates aren’t a problem.

What are primary keys and foreign keys?

Primary key: uniquely identifies a row in a table (often surrogate id). Foreign key: column(s) referencing another table’s key—enforces relational integrity when the DB enforces constraints.

What is database normalization (high level)?

Organizing tables to reduce redundancy: e.g. put repeating groups in their own table, use keys to relate. Tradeoff: more joins vs update anomalies. Denormalize selectively for read-heavy analytics.

What is an index and when can it hurt?

An index is a separate structure to find rows faster (like a book index). Helps selective lookups; hurts bulk loads and heavy writes (maintain index). Wrong index may be ignored by the optimizer if the predicate isn’t selective.

How do NULLs behave in comparisons and joins?

NULL = NULL is unknown, not true—use IS NULL. In joins, NULL doesn’t match NULL unless you use special predicates—classic source of “missing rows” bugs.

What is a correlated subquery?

A subquery that references columns from an outer query. It can run once per outer row (expensive)—often replaceable with a join or window for clarity and performance.

When would you use a window function instead of GROUP BY?

When you need per-row output with aggregates over a partition (running total, rank, previous row) without collapsing to one row per group.

How would you find duplicate keys in a table?

SELECT user_id, COUNT(*) AS c
FROM events
GROUP BY user_id
HAVING COUNT(*) > 1;

What is a transaction? Mention ACID briefly.

A transaction groups operations that commit or roll back together. ACID: Atomicity (all or nothing), Consistency (valid state), Isolation (concurrent transactions don’t corrupt each other—levels vary), Durability (after commit, survives crashes).

What is the difference between DELETE, TRUNCATE, and DROP?

DELETE removes rows (can filter, logged row-wise). TRUNCATE clears the table quickly (usually minimal row logging; resets storage semantics vary). DROP removes the table object entirely.

What does “sargable” mean?

A predicate can use an index/zone map—e.g. WHERE order_date >= '2026-01-01'. Often not sargable: WHERE YEAR(order_date) = 2026 because a function wraps the column.

How does pagination with LIMIT/OFFSET scale?

Large OFFSET scans/skips many rows each page—gets slow. Better patterns: keyset pagination (WHERE id > :last_id ORDER BY id LIMIT 50) or cursors—depends on product.

What would you look at in a query execution plan (high level)?

Join order and types (nested loop vs hash), estimated vs actual rows (cardinality mis-estimates), spills to disk, which indexes were used, and biggest cost operators to target first.

See also Python cheat sheet for paired data-engineering interviews.

Start here — if you are a student or fresher

Words that matter

Read a SELECT in English

Common fresher mistakes

Basics — read and filter data

Core statements

Joins (interview favourite)

Nulls and safety

Intermediate — aggregates, shapes, reuse

Group and filter groups

Common Table Expressions (CTEs)

CASE — conditional values

Set operations

Semi-join patterns

Expert — windows, analytics, SQL that “feels like code”

Window functions

Snowflake / modern dialect: QUALIFY

MERGE (upsert)

Performance mindset (interview gold)

Real time & pipelines (SQL’s role)

Vocabulary (one line each)

Why micro-batch is common

Interview angles — how to sound experienced

Top SQL interview questions (click to expand)

Read a `SELECT` in English

`CASE` — conditional values

Snowflake / modern dialect: `QUALIFY`

`MERGE` (upsert)