Why SQL will Never Die

R
Tables
databases
bigdata
Author

Ted Laderas

Published

November 4, 2024

I had an analyst learning R ask me whether their SQL (Structured Query Language) skills are still going to be valued down the line. I’m writing this to reassure them that SQL will always be relevant to data analysts and to demonstrate to you why it will always be useful.

Part of the reason why is that most modern database systems such as DuckDB and Snowflake have SQL built into them. So many people think in SQL, and it is a remarkably expressive language to build queries in.

A Quick Note

I have used databases for a long time, but I got a C in my databases class. Your mileage may vary from my comparisons.

Setup

We’ll use DuckDB to connect to a dataset.

We’ll start by connecting to the DuckDB system and write the penguins table as a view called penguins_view. This is the table view that we’ll query with both dbplyr/duckdbplyr and duckdb:

We can confirm that we loaded our data with the following statement:

We encountered collect() when we investigated parquet files.

tbl() vs dbGetQuery()

When we query the table in DuckDB, we have two main options. We can treat the table like a data source using tbl(), which means we can pipe it to dplyr() functions:

tbl(con, "penguins_view")

Or we can directly query the table using SQL statements and the dbGetQuery() function, such as the following:

dbGetQuery(con, "SELECT * from penguins_view")

Both of these methods get us where we want to go. The main difference is how we query and manipulate the data.

SQL is Easy to Read and Understand

If you look at the definition of the Structured Query Language (SQL), and data science frameworks such as pandas or the {tidyverse}, you will notice that both of these frameworks borrow terminology heavily from SQL.

For example, the select() statement in tidyverse lets you select columns:

Here’s the equivalent in SQL. Our statement is in quotes below:

Filtering

Let’s compare how we filter.

We use WHERE in SQL to specify our filters.

In dplyr, we use filter():

Joining Tables

One of the biggest advantages of using SQL is that you can shape the Query Plan. Good SQL writers can optimize a query so that it will execute faster.

Where SQL excels is in the joining of multiple tables at once, and with multiple criteria.

Query Optimizers

Most modern database systems include a query optimizer - routines that look at your overall query plan and tries to find the most efficient way to execute them.

Indexing

One of the fundamental things we can do in SQL that we can’t do in dplyr is creating indexes on table columns for faster searching.

Subqueries and Temporary Tables

I think one of the strongest arguments for learning SQL is because of subqueries and temporary tables.

Resources

Citation

BibTeX citation:
@online{laderas2024,
  author = {Laderas, Ted},
  title = {Why {SQL} Will {Never} {Die}},
  date = {2024-11-04},
  langid = {en}
}
For attribution, please cite this work as:
Laderas, Ted. 2024. “Why SQL Will Never Die.” November 4, 2024.