2 Week 1: DESCRIBE, SELECT, WHERE

2.1 Our Database Management System (DBMS) for this course

Client: R/RStudio w/ SQL
Database Engine: DuckDB
Data Storage: single file in data/ folder

2.2 Connecting to our database

To access the data, we need to create a database connection. We use dbConnect() from the DBI package to do this. The first argument specifies the Database engine (duckdb()), and the second provides the file location: "data/data/GiBleed_5.3_1.1.duckdb".

library(duckdb)

Loading required package: DBI

library(DBI)

con <- DBI::dbConnect(duckdb::duckdb(), "data/GiBleed_5.3_1.1.duckdb")

Once open, we can use con (our database connection)

2.3 Keep in Mind: SQL ignores letter case

These are the same to the database engine:

SELECT person_id FROM person;

select PERSON_ID FROM person;

And so on. Our convention is that we capitalize SQL clauses such as SELECT so you can differentiate them from other information.

2.4 Looking at the Entire Database

One of the first things we can learn is to show the contents of the entire database; we can do this with SHOW TABLES:

SHOW TABLES;

Displaying records 1 - 10
name
care_site
cdm_source
concept
concept_ancestor
concept_class
concept_relationship
concept_synonym
condition_era
condition_occurrence
cost

We can get further information about the tables within our database using DESCRIBE; This will give us more information about individual tables:

DESCRIBE;

Displaying records 1 - 10
database	schema	name	column_names	column_types	temporary
GiBleed_5	main	care_site	care_site_id , care_site_name , place_of_service_concept_id , location_id , care_site_source_value , place_of_service_source_value	INTEGER, VARCHAR, INTEGER, INTEGER, VARCHAR, VARCHAR	FALSE
GiBleed_5	main	cdm_source	cdm_source_name , cdm_source_abbreviation , cdm_holder , source_description , source_documentation_reference, cdm_etl_reference , source_release_date , cdm_release_date , cdm_version , vocabulary_version	VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, DATE , DATE , VARCHAR, VARCHAR	FALSE
GiBleed_5	main	concept	concept_id , concept_name , domain_id , vocabulary_id , concept_class_id, standard_concept, concept_code , valid_start_date, valid_end_date , invalid_reason	INTEGER, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, DATE , DATE , VARCHAR	FALSE
GiBleed_5	main	concept_ancestor	ancestor_concept_id , descendant_concept_id , min_levels_of_separation, max_levels_of_separation	INTEGER, INTEGER, INTEGER, INTEGER	FALSE
GiBleed_5	main	concept_class	concept_class_id , concept_class_name , concept_class_concept_id	VARCHAR, VARCHAR, INTEGER	FALSE
GiBleed_5	main	concept_relationship	concept_id_1 , concept_id_2 , relationship_id , valid_start_date, valid_end_date , invalid_reason	INTEGER, INTEGER, VARCHAR, DATE , DATE , VARCHAR	FALSE
GiBleed_5	main	concept_synonym	concept_id , concept_synonym_name, language_concept_id	INTEGER, VARCHAR, INTEGER	FALSE
GiBleed_5	main	condition_era	condition_era_id , person_id , condition_concept_id , condition_era_start_date , condition_era_end_date , condition_occurrence_count	INTEGER, INTEGER, INTEGER, DATE , DATE , INTEGER	FALSE
GiBleed_5	main	condition_occurrence	condition_occurrence_id , person_id , condition_concept_id , condition_start_date , condition_start_datetime , condition_end_date , condition_end_datetime , condition_type_concept_id , condition_status_concept_id , stop_reason , provider_id , visit_occurrence_id , visit_detail_id , condition_source_value , condition_source_concept_id , condition_status_source_value	INTEGER , INTEGER , INTEGER , DATE , TIMESTAMP, DATE , TIMESTAMP, INTEGER , INTEGER , VARCHAR , INTEGER , INTEGER , INTEGER , VARCHAR , INTEGER , VARCHAR	FALSE
GiBleed_5	main	cost	cost_id , cost_event_id , cost_domain_id , cost_type_concept_id , currency_concept_id , total_charge , total_cost , total_paid , paid_by_payer , paid_by_patient , paid_patient_copay , paid_patient_coinsurance , paid_patient_deductible , paid_by_primary , paid_ingredient_cost , paid_dispensing_fee , payer_plan_period_id , amount_allowed , revenue_code_concept_id , revenue_code_source_value, drg_concept_id , drg_source_value	INTEGER, INTEGER, VARCHAR, INTEGER, INTEGER, FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , INTEGER, FLOAT , INTEGER, VARCHAR, INTEGER, VARCHAR	FALSE

We’ll look at a few tables in our work:

person - Contains personal & demographic data
procedure_occurrence - procedures performed on patients and when they happened
concept - contains the specific information (names of concepts) that map into all three above tables

2.5 Describing a table

We can use DESCRIBE to get more information (the metadata) about a table.

DESCRIBE person

Displaying records 1 - 10
column_name	column_type	null	key	default	extra
person_id	INTEGER	YES	NA	NA	NA
gender_concept_id	INTEGER	YES	NA	NA	NA
year_of_birth	INTEGER	YES	NA	NA	NA
month_of_birth	INTEGER	YES	NA	NA	NA
day_of_birth	INTEGER	YES	NA	NA	NA
birth_datetime	TIMESTAMP	YES	NA	NA	NA
race_concept_id	INTEGER	YES	NA	NA	NA
ethnicity_concept_id	INTEGER	YES	NA	NA	NA
location_id	INTEGER	YES	NA	NA	NA
provider_id	INTEGER	YES	NA	NA	NA

We will pay attention to column_name and column_type for the moment.

2.6 Data Types

If you look at the column_type for one of the DESCRIBE statements above, you’ll notice there are different data types:

INTEGER
TIMESTAMP
DATE
VARCHAR

Each column of a database needs to be typed. The data type of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like date arithmetic on DATETIME columns, asking the engine to calculate 5 days after the dates.

You can see all of the datatypes that are available in DuckDB here.

2.7 `SELECT` and `FROM`

If we want to see the contents of a table, we can use SELECT and FROM.

SELECT *          # select all columns
  FROM person     # from the person table
  LIMIT 10;       # return only 10 rows

SELECT * 
  FROM person 
  LIMIT 10;

Displaying records 1 - 10
person_id	gender_concept_id	year_of_birth	month_of_birth	day_of_birth	birth_datetime	race_concept_id	location_id	provider_id	care_site_id	person_source_value	gender_source_value	race_source_value	ethnicity_source_value
6	8532	1963	12	31	1963-12-31	8516	NA	NA	NA	001f4a87-70d0-435c-a4b9-1425f6928d33	F	black	west_indian
123	8507	1950	4	12	1950-04-12	8527	NA	NA	NA	052d9254-80e8-428f-b8b6-69518b0ef3f3	M	white	italian
129	8507	1974	10	7	1974-10-07	8527	NA	NA	NA	054d32d5-904f-4df4-846b-8c08d165b4e9	M	white	polish
16	8532	1971	10	13	1971-10-13	8527	NA	NA	NA	00444703-f2c9-45c9-a247-f6317a43a929	F	white	american
65	8532	1967	3	31	1967-03-31	8516	NA	NA	NA	02a3dad9-f9d5-42fb-8074-c16d45b4f5c8	F	black	dominican
74	8532	1972	1	5	1972-01-05	8527	NA	NA	NA	02fbf1be-29b7-4da8-8bbd-14c7433f843f	F	white	english
42	8532	1909	11	2	1909-11-02	8527	NA	NA	NA	0177d2e0-98f5-4f3d-bcfd-497b7a07b3f8	F	white	irish
187	8507	1945	7	23	1945-07-23	8527	NA	NA	NA	07a1e14d-73ed-4d3a-9a39-d729745773fa	M	white	irish
18	8532	1965	11	17	1965-11-17	8527	NA	NA	NA	0084b0fe-e30f-4930-b6d1-5e1eff4b7dea	F	white	english
111	8532	1975	5	2	1975-05-02	8527	NA	NA	NA	0478d6b3-bdb3-4574-9b93-cf448d725b84	F	white	english

2.8 Try it Out

Look at the first few rows of procedure_occurrence.

SELECT * 
  FROM ____ 
  LIMIT 10;

Why is there a person_id column in this table as well?

2.9 `SELECT`ing a few columns in our table

We can use the SELECT clause to grab specific columns in our data.

SELECT person_id, birth_datetime, gender_concept_id # Columns in our table
  FROM person;                                      # Our Table

SELECT person_id, birth_datetime, gender_concept_id 
  FROM person
  LIMIT 10;

Displaying records 1 - 10
person_id	birth_datetime	gender_concept_id
6	1963-12-31	8532
123	1950-04-12	8507
129	1974-10-07	8507
16	1971-10-13	8532
65	1967-03-31	8532
74	1972-01-05	8532
42	1909-11-02	8532
187	1945-07-23	8507
18	1965-11-17	8532
111	1975-05-02	8532

2.10 Try it Out

What happens if we ask for a column that doesn’t exist in our data?

SELECT person_id, birth_datetime, gender_concept_id, blah
  FROM person;

2.11 Check on Learning

Add race_concept_id and year_of_birth to your SELECT query:

SELECT person_id, birth_datetime, gender_concept_id, ____, ____
  FROM person;

2.12 `WHERE` - filtering our table

Adding WHERE to our SQL statement lets us add filtering to our query:

SELECT person_id, gender_source_value, race_source_value, year_of_birth 
  FROM person 
  WHERE year_of_birth < 2000

Displaying records 1 - 10
person_id	gender_source_value	race_source_value	year_of_birth
6	F	black	1963
123	M	white	1950
129	M	white	1974
16	F	white	1971
65	F	black	1967
74	F	white	1972
42	F	white	1909
187	M	white	1945
18	F	white	1965
111	F	white	1975

One critical thing to know is that you don’t need to include the columns you’re filtering on in the SELECT part of the statement. For example, we could do the following as well, removing year_of_birth from our SELECT:

SELECT person_id, gender_source_value, race_source_value 
  FROM person 
  WHERE year_of_birth < 2000

Displaying records 1 - 10
person_id	gender_source_value	race_source_value
6	F	black
123	M	white
129	M	white
16	F	white
65	F	black
74	F	white
42	F	white
187	M	white
18	F	white
111	F	white

2.12.1 Single quotes and `WHERE`

SQL convention: single quotes (‘M’) refer to values, and double quotes refer to columns (“person_id”). If you try to use double quotes in the below, it will look for a column called “M”.

This will trip you up several times if you’re not used to it.

SELECT person_id, gender_source_value 
  FROM person 
  WHERE gender_source_value = 'M'
  LIMIT 10;

Displaying records 1 - 10
person_id	gender_source_value
123	M
129	M
187	M
40	M
53	M
78	M
69	M
248	M
105	M
49	M

Reminder: use single (’’) quotes in your SQL statements to refer to values, not double quotes (“).

2.12.2 Quick Note

For R users, notice the similarity of select() with SELECT. We can rewrite the above in dplyr code as:

person |>
  select(person_id, gender_source_value, race_source_value)

A lot of dplyr was inspired by SQL. In fact, there is a package called dbplyr that translates dplyr statements into SQL. A lot of us use it, and it’s pretty handy.

2.13 `COUNT` - how many entries?

Sometimes you want to know the size of your result, not necessarily return the entire set of results. That is what COUNT is for.

SELECT COUNT(*)
  FROM procedure_occurrence;

1 records
count_star()
37409

Similarly, when we want to count the number of person_ids returned, we can use COUNT(person_id):

SELECT COUNT(procedure_concept_id)
  FROM procedure_occurrence;

1 records
count(procedure_concept_id)
37409

There are repeat procedure_concept_ids in the procedure_occurrence table. When you have repeated values in the rows, COUNT(DISTINCT ) can help you find the number of unique values in a column:

SELECT COUNT(DISTINCT procedure_concept_id)
  FROM procedure_occurrence

1 records
count(DISTINCT procedure_concept_id)
51

We can also return the actual DISTINCT values by removing COUNT:

SELECT DISTINCT procedure_concept_id
  FROM procedure_occurrence;

Displaying records 1 - 10
procedure_concept_id
4010253
4293740
4195803
4062501
4037675
4181917
4234705
4043071
4252419
4057420

2.14 Check on Learning

Count the distinct values of gender_source_value in person:

2.15 Revisiting `DESCRIBE`

Let’s return to our table metadata and look at it more in depth:

DESCRIBE person

Displaying records 1 - 10
column_name	column_type	null	key	default	extra
person_id	INTEGER	YES	NA	NA	NA
gender_concept_id	INTEGER	YES	NA	NA	NA
year_of_birth	INTEGER	YES	NA	NA	NA
month_of_birth	INTEGER	YES	NA	NA	NA
day_of_birth	INTEGER	YES	NA	NA	NA
birth_datetime	TIMESTAMP	YES	NA	NA	NA
race_concept_id	INTEGER	YES	NA	NA	NA
ethnicity_concept_id	INTEGER	YES	NA	NA	NA
location_id	INTEGER	YES	NA	NA	NA
provider_id	INTEGER	YES	NA	NA	NA

One of the important properties of data in a relational database is that there are no repeat rows in the database. Each table that meets this restriction has what is called a primary key.

Scanning the rows, which field/column is the primary key for person?

Try and find the primary key for procedure_occurrence. What is it?

DESCRIBE procedure_occurrence

Displaying records 1 - 10
column_name	column_type	null	key	default	extra
procedure_occurrence_id	INTEGER	YES	NA	NA	NA
person_id	INTEGER	YES	NA	NA	NA
procedure_concept_id	INTEGER	YES	NA	NA	NA
procedure_date	DATE	YES	NA	NA	NA
procedure_datetime	TIMESTAMP	YES	NA	NA	NA
procedure_type_concept_id	INTEGER	YES	NA	NA	NA
modifier_concept_id	INTEGER	YES	NA	NA	NA
quantity	INTEGER	YES	NA	NA	NA
provider_id	INTEGER	YES	NA	NA	NA
visit_occurrence_id	INTEGER	YES	NA	NA	NA

We'll see that primary keys need to be unique (so they can map to each row).

What column is the same in both tables? That is a hint for what we'll cover next week: JOINing tables.

2.16 Always close the connection

When we’re done, it’s best to close the connection with dbDisconnect().

dbDisconnect(con)