5  Week 4: Subqueries and Views

In this lecture, we will explore subqueries and views in SQL. These concepts help us break complicated queries into smaller, more manageable parts, improving our ability to read and maintain our code.

5.1 Connecting to our database

As always, to access the data, we create our database connection to our data.

library(duckdb)
library(DBI)

con <- DBI::dbConnect(
  duckdb::duckdb(), 
  "data/GiBleed_5.3_1.1.duckdb"
)

With our data loaded and ready to go, let’s get started!

5.2 Subqueries

A subquery is a query nested inside another query. Subqueries let us process smaller computations inside larger outer queries.

5.2.1 Using a Subquery in the SELECT Clause

The following is a great example from The Data School, offering a visualization of how a subquery works. In this case, we use a subquery to calculate the total number of friends across all individuals, subdivided by state. Here, we are making use of the subquery within our SELECT clause. Let’s dive a little deeper into this type of example using our own data.

5.2.1.1 A brief tangent: using DATEDIFF to compare dates

The DATEDIFF function in SQL can be used to calculate differences between days. DATEDIFF takes three parameters: the unit of time, a first date, and a second date. For instance, calling:

SELECT DATEDIFF('month', DATE '2020-01-01', DATE '2024-03-07')

calculates the number of months between January 1st, 2020, and March 7th, 2024. All three parameters are required. You can refer to the documentation for DATEDIFF here to see other options for time intervals.

Note

Dates in SQL typically follow the ISO 8601 format of ‘YYYY-MM-DD’. Other date formats may work depending on the database system being used, though there is a chance for misinterpretation.

Note that in our example, we explicitly cast our two dates as DATE variables - while this is not necessary depending on the database system, it enhances readability and interpretability of the code for both other users as well as the database system.

  1. What do you think happens if we swap the order of dates in the DATEDIFF command?

5.2.2 Example: Using a Subquery in the SELECT Clause

Let’s use a subquery to dynamically calculate the age of each individual (as of March 7th, 2024) in our database while collecting other patient demographic data. To handle this, we’ll make use of the person table in our dataframe and the birth_datetime column.

SELECT 
  person_id, 
  birth_datetime,
  gender_source_value, 
  race_source_value, 
  ethnicity_source_value,
  (SELECT 
    DATE_DIFF('year', birth_datetime, DATE '2024-03-07')
  ) AS age
FROM 
  person;

As we can see in the above example, we’ve performed the computation of calculating patient age in a subquery:

SELECT 
    DATE_DIFF('year', birth_datetime, DATE '2024-03-07')

This subquery is integrated into the larger query of collecting patient data.

5.2.2.1 Check on learning

Fill in the blank in the query below to dynamically calculate the number of days between the condition start date and condition end date for all conditions from the condition_occurrence table

SELECT 
  person_id,
  visit_occurrence_id,
  condition_occurrence_id,
  condition_concept_id, 
  condition_start_date,
  condition_end_date,
  (SELECT 
    DATE_DIFF(_____, _____, _____)
  ) AS condition_time_span
FROM 
  condition_occurrence;

5.2.3 Filtering with a Subquery

We’ve now worked through a couple of examples where we use subqueries to create new variables within our SELECT clause. Another type of query we can tackle is the filtration of data based on conditions calculated in a subquery.

Here’s another great example from The Data School, where we apply a subquery in the filtration component of our larger query to find individuals on Facebook who have the same number of Facebook connections as anyone else on LinkedIn.

5.2.3.1 A brief review: the IN clause

The IN clause in SQL is used to filter records where a column matches any value in a specified list or subquery result. It is a shorthand for multiple OR conditions and is commonly used for readability and efficiency.

For instance, the basic syntax of:

SELECT column_name
FROM table_name
WHERE column_name IN (value1, value2, value3);

is equivalent to:

SELECT column_name
FROM table_name
WHERE column_name = value1 
   OR column_name = value2 
   OR column_name = value3;

Now back to using a subquery for filtering!

5.2.4 Example: Filtering with a Subquery

For our own database, let’s collect patient demographic data for all patients who had some kind of procedure performed after December 31st, 2018. We’ll make use of the person and procedure_occurrence tables for this query.

We can start by writing the computation for our subquery - collection patient IDs for individuals who had a procedure after December 31st, 2018.

SELECT 
  person_id 
FROM 
  procedure_occurrence
WHERE 
  procedure_datetime >= DATE '2019-01-01';

Now, we can insert this query into the WHERE clause of our larger query that collects patient demographic information!

SELECT 
  person_id, 
  birth_datetime, 
  gender_source_value, 
  race_source_value, 
  ethnicity_source_value
FROM 
  person
WHERE 
  person_id IN (
    SELECT 
      person_id 
    FROM 
      procedure_occurrence
    WHERE 
      procedure_datetime >= DATE '2019-01-01'
  );

5.2.4.1 Check on learning

Write out a query to collection patient IDs for individuals who had a condition start date after December 31st, 2018. This query will become the subquery in our larger computation.

SELECT 
  person_id 
FROM 
  condition_occurrence
WHERE 
  condition_start_date >= ______

Now, fill in the blank in the following SQL query with the subquery that you just developed to collect patient demographic data for any patient in the condition_occurrence table who had a condition start date on or after January 1st, 2019.

SELECT 
  person_id, 
  birth_datetime, 
  gender_source_value, 
  race_source_value, 
  ethnicity_source_value
FROM 
  person
WHERE 
  person_id IN (_________);

5.2.5 When to use subqueries

Subqueries are powerful because they allow you to break down complex queries into smaller, more manageable parts. You should use subqueries when:

  • You need to use a computed value in a query: See our previous examples!

  • You want to avoid duplicating code: Instead of repeating a calculation, you can use a subquery to define it a single time and reuse it (i.e. age).

  • You want to avoid performing unnecessary JOIN’s: Subqueries let you filter results row-by-row based on information from another table without requiring a JOIN.

  • You need to improve your code’s readability: Subqueries help make queries more modular and easier to debug. Conceptually, it can be easier to create a multi-step query and check intermediate phases than do perform a bunch of JOIN’s.

  1. Can you think of any examples where it might be better to use a JOIN over a subquery?

5.3 Views

A view is a stored SQL query that acts as a virtual table. Views improve code reuse and readability. The following image, taken from SQLShack, depicts how a complicated query can be turned into a customized view that can be used in downstream data processing.

5.3.1 When to use views

Similar to subqueries, views allow us to organize our data into more modular, accessible, and easy-to-read components. You should use views when:

  • You want to simplify complex queries and improve code maintainability: Instead of frequently reusing a complex query, you can store the data generated from the query into a view and access it readily.

  • You want to enhance security and restrict data access to others without authorization: Making your own view can limit access to sensitive columns while still allowing other users to query the necessary data

  • You want to promote data consistency: Performing a calculation in a view ensures that everyone uses the same calculation to grab consistent data (e.g. calculating age of patients)

Note

A view itself does not actually store data like a physical table does. Instead, a view is a saved SQL query that gets executed each time you query the view.

5.3.2 A brief tangent: indexing

Indexing is a technique used to speed up data retrieval from a database table. An index improves the efficiency of queries by allowing the database to locate rows faster without having to scan the entire table. This is similar to how a table of contents in a book helps you quickly find chapters instead of reading every page.

However, views are not indexed: Since views are virtual tables, they do not store data or have their own indexes. Instead, they rely on the indexes that come from the underlying tables. Because views do not have indexes, querying a view can be slower than querying a physical table. Indeed, since the database recomputes the view’s query each time, more complex views can lead to performance issues.

5.3.3 Example: Creating a View

With our own data, let’s create a view from the concept table that focuses on the drugs in our dataset.

CREATE VIEW drugs AS
SELECT * FROM concept
WHERE domain_id == 'Drug';

Now, we can use this view just like a table:

SELECT * 
FROM drugs
LIMIT 5;
5 records
concept_id concept_name domain_id vocabulary_id concept_class_id standard_concept concept_code valid_start_date valid_end_date invalid_reason
1118088 celecoxib 200 MG Oral Capsule [Celebrex] Drug RxNorm Branded Drug S 213469 1970-01-01 2099-12-31 NA
40213201 pneumococcal polysaccharide vaccine, 23 valent Drug CVX CVX S 33 2008-12-01 2099-12-31 NA
1557272 Alendronate Drug RxNorm Ingredient S 46041 1970-01-01 2099-12-31 NA
19129655 Ampicillin 100 MG/ML Injectable Solution Drug RxNorm Clinical Drug S 789980 2008-03-30 2099-12-31 NA
44923712 celecoxib 200 MG Oral Capsule [Celebrex] Drug NDC 11-digit NDC NA 00025152531 2000-01-01 2099-12-31 NA
Note

If a view already exists in your database, then trying to create a new view with the same name will generate an error! To delete a view from memory, using the DROP VIEW command. E.g.:

DROP VIEW IF EXISTS drugs;

5.3.3.1 Check on learning

Fill in the blank in the query below to create a view that stores only measurements from the concept table

CREATE VIEW measurements AS
SELECT * FROM concept
WHERE domain_id == ________;
SELECT *
FROM measurements
LIMIT 5;

5.4 Query Optimization

While writing efficient SQL queries is important, database performance optimization is a complex topic that is mostly beyond the scope of “Intro to SQL”. However, here are some key takeaways to keep in mind:

  • Do not manually create indexes: Indexing can significantly improve query performance, but in most cases, it is the responsibility of the Database Administrator (DBA) to manage indexes appropriately. If you believe an index is needed, consult with your DBA.

  • When in doubt, talk to your database administrator: Especially when your database is transactional, you should not be the one doing these modifications! DBAs have the expertise to optimize database performance, manage indexing, and ensure efficient query execution. Trust your DBA!

5.5 Summary

  • Subqueries allow us to use the result of one query inside another
  • Views provide a way to store and reuse complex queries as virtual tables
  • Using subqueries and views can make SQL queries more modular and maintanable.

5.6 Always close the connection

When we’re done, always close the connection with dbDisconnect().

5.7 References

  • The Data School - all SUBQUERY animations come from here
  • SQL Shack - the image depicting the creation of a VIEW comes from here
  • W3 Schools - a reference for parameter options for DATEDIFF