Loading required package: DBI
2 Week 1: DESCRIBE, SELECT, WHERE
2.1 Our Composable Database System
- Client: R/RStudio w/ SQL
- Database Engine: DuckDB
- Data Storage: single file in
data/folder
2.2 Connecting to our database
To access the data, we need to create a database connection. We use dbConnect() from the DBI package to do this. The first argument specifies the Database engine (duckdb()), and the second provides the file location: "data/data/GiBleed_5.3_1.1.duckdb".
Once open, we can use con (our database connection)
These are the same to the database engine:
SELECT person_id FROM person;
select PERSON_ID FROM person;
And so on. Our convention is that we capitalize SQL clauses such as SELECT so you can differentiate them from other information.
2.3 Looking at the Entire Database
One of the first things we can learn is to show the contents of the entire database; we can do this with SHOW TABLES:
SHOW TABLES;| name |
|---|
| care_site |
| cdm_source |
| concept |
| concept_ancestor |
| concept_class |
| concept_relationship |
| concept_synonym |
| condition_era |
| condition_occurrence |
| cost |
We can get further information about the tables within our database using DESCRIBE; This will give us more information about individual tables:
DESCRIBE;| database | schema | name | column_names | column_types | temporary |
|---|---|---|---|---|---|
| GiBleed_5 | main | care_site | care_site_id , care_site_name , place_of_service_concept_id , location_id , care_site_source_value , place_of_service_source_value | INTEGER, VARCHAR, INTEGER, INTEGER, VARCHAR, VARCHAR | FALSE |
| GiBleed_5 | main | cdm_source | cdm_source_name , cdm_source_abbreviation , cdm_holder , source_description , source_documentation_reference, cdm_etl_reference , source_release_date , cdm_release_date , cdm_version , vocabulary_version | VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, DATE , DATE , VARCHAR, VARCHAR | FALSE |
| GiBleed_5 | main | concept | concept_id , concept_name , domain_id , vocabulary_id , concept_class_id, standard_concept, concept_code , valid_start_date, valid_end_date , invalid_reason | INTEGER, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, VARCHAR, DATE , DATE , VARCHAR | FALSE |
| GiBleed_5 | main | concept_ancestor | ancestor_concept_id , descendant_concept_id , min_levels_of_separation, max_levels_of_separation | INTEGER, INTEGER, INTEGER, INTEGER | FALSE |
| GiBleed_5 | main | concept_class | concept_class_id , concept_class_name , concept_class_concept_id | VARCHAR, VARCHAR, INTEGER | FALSE |
| GiBleed_5 | main | concept_relationship | concept_id_1 , concept_id_2 , relationship_id , valid_start_date, valid_end_date , invalid_reason | INTEGER, INTEGER, VARCHAR, DATE , DATE , VARCHAR | FALSE |
| GiBleed_5 | main | concept_synonym | concept_id , concept_synonym_name, language_concept_id | INTEGER, VARCHAR, INTEGER | FALSE |
| GiBleed_5 | main | condition_era | condition_era_id , person_id , condition_concept_id , condition_era_start_date , condition_era_end_date , condition_occurrence_count | INTEGER, INTEGER, INTEGER, DATE , DATE , INTEGER | FALSE |
| GiBleed_5 | main | condition_occurrence | condition_occurrence_id , person_id , condition_concept_id , condition_start_date , condition_start_datetime , condition_end_date , condition_end_datetime , condition_type_concept_id , condition_status_concept_id , stop_reason , provider_id , visit_occurrence_id , visit_detail_id , condition_source_value , condition_source_concept_id , condition_status_source_value | INTEGER , INTEGER , INTEGER , DATE , TIMESTAMP, DATE , TIMESTAMP, INTEGER , INTEGER , VARCHAR , INTEGER , INTEGER , INTEGER , VARCHAR , INTEGER , VARCHAR | FALSE |
| GiBleed_5 | main | cost | cost_id , cost_event_id , cost_domain_id , cost_type_concept_id , currency_concept_id , total_charge , total_cost , total_paid , paid_by_payer , paid_by_patient , paid_patient_copay , paid_patient_coinsurance , paid_patient_deductible , paid_by_primary , paid_ingredient_cost , paid_dispensing_fee , payer_plan_period_id , amount_allowed , revenue_code_concept_id , revenue_code_source_value, drg_concept_id , drg_source_value | INTEGER, INTEGER, VARCHAR, INTEGER, INTEGER, FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , FLOAT , INTEGER, FLOAT , INTEGER, VARCHAR, INTEGER, VARCHAR | FALSE |
We’ll look at a few tables in our work:
-
person- Contains personal & demographic data -
procedure_occurrence- procedures performed on patients and when they happened -
condition_occurrence- patient conditions (such as illnesses) and when they occurred -
concept- contains the specific information (names of concepts) that map into all three above tables
We’ll talk much more later about the relationships between these tables.
2.4 SELECT and FROM
If we want to see the contents of a table, we can use SELECT and FROM.
SELECT * # select all columns
FROM person # from the person table
LIMIT 10; # return only 10 rows
SELECT * FROM person LIMIT 10;| person_id | gender_concept_id | year_of_birth | month_of_birth | day_of_birth | birth_datetime | race_concept_id | ethnicity_concept_id | location_id | provider_id | care_site_id | person_source_value | gender_source_value | gender_source_concept_id | race_source_value | race_source_concept_id | ethnicity_source_value | ethnicity_source_concept_id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 8532 | 1963 | 12 | 31 | 1963-12-31 | 8516 | 0 | NA | NA | NA | 001f4a87-70d0-435c-a4b9-1425f6928d33 | F | 0 | black | 0 | west_indian | 0 |
| 123 | 8507 | 1950 | 4 | 12 | 1950-04-12 | 8527 | 0 | NA | NA | NA | 052d9254-80e8-428f-b8b6-69518b0ef3f3 | M | 0 | white | 0 | italian | 0 |
| 129 | 8507 | 1974 | 10 | 7 | 1974-10-07 | 8527 | 0 | NA | NA | NA | 054d32d5-904f-4df4-846b-8c08d165b4e9 | M | 0 | white | 0 | polish | 0 |
| 16 | 8532 | 1971 | 10 | 13 | 1971-10-13 | 8527 | 0 | NA | NA | NA | 00444703-f2c9-45c9-a247-f6317a43a929 | F | 0 | white | 0 | american | 0 |
| 65 | 8532 | 1967 | 3 | 31 | 1967-03-31 | 8516 | 0 | NA | NA | NA | 02a3dad9-f9d5-42fb-8074-c16d45b4f5c8 | F | 0 | black | 0 | dominican | 0 |
| 74 | 8532 | 1972 | 1 | 5 | 1972-01-05 | 8527 | 0 | NA | NA | NA | 02fbf1be-29b7-4da8-8bbd-14c7433f843f | F | 0 | white | 0 | english | 0 |
| 42 | 8532 | 1909 | 11 | 2 | 1909-11-02 | 8527 | 0 | NA | NA | NA | 0177d2e0-98f5-4f3d-bcfd-497b7a07b3f8 | F | 0 | white | 0 | irish | 0 |
| 187 | 8507 | 1945 | 7 | 23 | 1945-07-23 | 8527 | 0 | NA | NA | NA | 07a1e14d-73ed-4d3a-9a39-d729745773fa | M | 0 | white | 0 | irish | 0 |
| 18 | 8532 | 1965 | 11 | 17 | 1965-11-17 | 8527 | 0 | NA | NA | NA | 0084b0fe-e30f-4930-b6d1-5e1eff4b7dea | F | 0 | white | 0 | english | 0 |
| 111 | 8532 | 1975 | 5 | 2 | 1975-05-02 | 8527 | 0 | NA | NA | NA | 0478d6b3-bdb3-4574-9b93-cf448d725b84 | F | 0 | white | 0 | english | 0 |
- Why are there
birth_datetimeand themonth_of_birth,day_of_birth,year_of_birth- aren’t these redundant?
2.5 Try it Out
Look at the first few rows of procedure_occurrence.
SELECT * FROM ____ LIMIT 10;- Why is there a
person_idcolumn in this table as well?
2.6 SELECTing a few columns in our table
We can use the SELECT clause to grab specific columns in our data.
SELECT person_id, birth_datetime, gender_concept_id # Columns in our table
FROM person; # Our Table
SELECT person_id, birth_datetime, gender_concept_id
FROM person
LIMIT 10;| person_id | birth_datetime | gender_concept_id |
|---|---|---|
| 6 | 1963-12-31 | 8532 |
| 123 | 1950-04-12 | 8507 |
| 129 | 1974-10-07 | 8507 |
| 16 | 1971-10-13 | 8532 |
| 65 | 1967-03-31 | 8532 |
| 74 | 1972-01-05 | 8532 |
| 42 | 1909-11-02 | 8532 |
| 187 | 1945-07-23 | 8507 |
| 18 | 1965-11-17 | 8532 |
| 111 | 1975-05-02 | 8532 |
2.7 Try it Out
What happens if we ask for a column that doesn’t exist in our data?
SELECT person_id, birth_datetime, gender_concept_id, blah
FROM person;2.8 Check on Learning
Add race_concept_id and year_of_birth to your SELECT query:
SELECT person_id, birth_datetime, gender_concept_id, ____, ____
FROM person;
2.9 WHERE - filtering our table
Adding WHERE to our SQL statement lets us add filtering to our query:
SELECT person_id, gender_source_value, race_source_value, year_of_birth
FROM person
WHERE year_of_birth < 1980| person_id | gender_source_value | race_source_value | year_of_birth |
|---|---|---|---|
| 6 | F | black | 1963 |
| 123 | M | white | 1950 |
| 129 | M | white | 1974 |
| 16 | F | white | 1971 |
| 65 | F | black | 1967 |
| 74 | F | white | 1972 |
| 42 | F | white | 1909 |
| 187 | M | white | 1945 |
| 18 | F | white | 1965 |
| 111 | F | white | 1975 |
One critical thing to know is that you don’t need to include the columns you’re filtering on in the SELECT part of the statement. For example, we could do the following as well, removing year_of_birth from our SELECT:
SELECT person_id, gender_source_value, race_source_value
FROM person
WHERE year_of_birth < 2000| person_id | gender_source_value | race_source_value |
|---|---|---|
| 6 | F | black |
| 123 | M | white |
| 129 | M | white |
| 16 | F | white |
| 65 | F | black |
| 74 | F | white |
| 42 | F | white |
| 187 | M | white |
| 18 | F | white |
| 111 | F | white |
2.9.1 Single quotes and WHERE
SQL convention: single quotes (‘M’) refer to values, and double quotes refer to columns (“person_id”). If you try to use double quotes in the below, it will look for a column called “M”.
This will trip you up several times if you’re not used to it.
SELECT person_id, gender_source_value, race_source_value
FROM person
WHERE gender_source_value = 'M'
LIMIT 10;| person_id | gender_source_value | race_source_value |
|---|---|---|
| 123 | M | white |
| 129 | M | white |
| 187 | M | white |
| 40 | M | white |
| 53 | M | white |
| 78 | M | white |
| 69 | M | asian |
| 248 | M | white |
| 105 | M | white |
| 49 | M | white |
Reminder: use single (’’) quotes in your SQL statements to refer to values, not double quotes (“).
For R users, notice the similarity of select() with SELECT. We can rewrite the above in dplyr code as:
person |>
select(person_id, gender_source_value, race_source_value)A lot of dplyr was inspired by SQL. In fact, there is a package called dbplyr that translates dplyr statements into SQL. A lot of us use it, and it’s pretty handy.
2.10 COUNT - how many rows?
Sometimes you want to know the size of your result, not necessarily return the entire set of results. That is what COUNT is for.
SELECT COUNT(*)
FROM person
WHERE year_of_birth < 2000;| count_star() |
|---|
| 2694 |
Similarly, when we want to count the number of person_ids returned, we can use COUNT(person_id):
SELECT COUNT(person_id)
FROM person
WHERE year_of_birth < 2000;| count(person_id) |
|---|
| 2694 |
Let’s switch gears to the procedure_concept_id table. Let’s count the overall number of procedure_concept_ids in our table:
SELECT COUNT(procedure_concept_id)
FROM procedure_occurrence;| count(procedure_concept_id) |
|---|
| 37409 |
Hmmm. That’s quite a lot, but are there repeat procedure_concept_ids?
When you have repeated values in the rows, COUNT(DISTINCT ) can help you find the number of unique values in a column:
SELECT COUNT(DISTINCT procedure_concept_id)
FROM procedure_occurrence| count(DISTINCT procedure_concept_id) |
|---|
| 51 |
We can also return the actual DISTINCT values by removing COUNT:
SELECT DISTINCT procedure_concept_id
FROM procedure_occurrence;| procedure_concept_id |
|---|
| 4058899 |
| 4295880 |
| 4216130 |
| 4024289 |
| 4202451 |
| 4330583 |
| 4238715 |
| 4186930 |
| 4242997 |
| 4078793 |
2.11 Check on Learning
Count the distinct values of gender_source_value in person:
SELECT COUNT(DISTINCT --------------)
FROM -------;2.12 Keys: Linking tables together
One of the important properties of data in a relational database is that there are no repeat rows in the database. Each table that meets this restriction has what is called a primary key.
We can use DESCRIBE to get more information (the metadata) about a table. This gives us information about our tables.
DESCRIBE person| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| person_id | INTEGER | YES | NA | NA | NA |
| gender_concept_id | INTEGER | YES | NA | NA | NA |
| year_of_birth | INTEGER | YES | NA | NA | NA |
| month_of_birth | INTEGER | YES | NA | NA | NA |
| day_of_birth | INTEGER | YES | NA | NA | NA |
| birth_datetime | TIMESTAMP | YES | NA | NA | NA |
| race_concept_id | INTEGER | YES | NA | NA | NA |
| ethnicity_concept_id | INTEGER | YES | NA | NA | NA |
| location_id | INTEGER | YES | NA | NA | NA |
| provider_id | INTEGER | YES | NA | NA | NA |
Scanning the rows, which field/column is the primary key for person?
Try and find the primary key for procedure_occurrence. What is it?
DESCRIBE procedure_occurrence| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| procedure_occurrence_id | INTEGER | YES | NA | NA | NA |
| person_id | INTEGER | YES | NA | NA | NA |
| procedure_concept_id | INTEGER | YES | NA | NA | NA |
| procedure_date | DATE | YES | NA | NA | NA |
| procedure_datetime | TIMESTAMP | YES | NA | NA | NA |
| procedure_type_concept_id | INTEGER | YES | NA | NA | NA |
| modifier_concept_id | INTEGER | YES | NA | NA | NA |
| quantity | INTEGER | YES | NA | NA | NA |
| provider_id | INTEGER | YES | NA | NA | NA |
| visit_occurrence_id | INTEGER | YES | NA | NA | NA |
We’ll see that keys need to be unique (so they can map to each row). In fact, each key is a way to connect one table to another.
What column is the same in both tables? That is a hint for what we’ll cover next week: JOINing tables.
2.13 Data Types
If you look at the column_type for one of the DESCRIBE statements above, you’ll notice there are different data types:
INTEGERTIMESTAMPDATEVARCHAR
Each column of a database needs to be typed. The data type of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like date arithmetic on DATETIME columns, asking the engine to calculate 5 days after the dates.
You can see all of the datatypes that are available in DuckDB here.
2.14 Always close the connection
When we’re done, it’s best to close the connection with dbDisconnect().
dbDisconnect(con)