Chapter 3 Working with data structures
In our second lesson, we start to look at two data structures, vectors and dataframes, that can handle a large amount of data.
3.2 Vectors
In the first exercise, you started to explore data structures, which store information about data types. You played around with vectors, which is a ordered collection of a data type. Each element of a vector contains a data type, and there is no limit on how big a vector can be, as long the memory use of it is within the computer’s memory (RAM).
We can now store a vast amount of information in a vector, and assign it to a single variable. We can now use operations and functions on a vector, modifying many elements within the vector at once! This fits with the feature of “encapsulate complex data via data structures to allow efficient manipulation of data” described in the first lesson!
We often create vectors using the combine function, c()
:
If we try to create a vector with mixed data types, R will try to make them be the same data type, or give an error:
## [1] "chris" "shasta" "123"
Our numeric got converted to character so that the entire vector is all characters.
3.2.1 Using operations on vectors
Recall from the first class:
Expressions are be built out of operations or functions.
Operations and functions combine data types to return another data type.
Now that we are working with data structures, the same principle applies:
- Operations and functions combine data structures to return another data structure (or data type!).
What happens if we use some familiar operations we used for numerics on a numerical vector? If we multiply a numerical vector by a numeric, what do we get?
## [1] 6 9 3
All of chrNum
’s elements tripled! Our multiplication operation, when used on a numeric vector with a numeric, has a new meaning: it multiplied all the elements by 3. Multiplication is an operation that can be used for multiple data types or data structures: we call this property operator overloading. Here’s another example: numeric vector multiplied by another numeric vector:
## [1] 12 18 0
but there are also limits: a numeric vector added to a character vector creates an error:
When we work with operations and functions, we must be mindful what inputs the operation or function takes in, and what outputs it gives, no matter how “intuitive” the operation or function name is.
3.2.2 Subsetting vectors explicitly
In the exercise this past week, you looked at a new operation to subset elements of a vector using brackets.
Inside the bracket is either a single numeric value or an a numerical indexing vector containing numerical values. They dictate which elements of the vector to return.
## [1] "shasta"
## [1] "chris" "shasta"
In the last line, we created a new vector small_staff
that is a subset of the staff given the indexing vector c(1, 2)
. We have three vectors referenced in one line of code. This is tricky and we need to always refer to our rules step-by-step: evaluate the expression right of the =
, which contains a vector bracket. Follow the rule of the vector bracket. Then store the returning value to the variable left of =
.
Alternatively, instead of using numerical indexing vectors, we can use a logical indexing vector. The logical indexing vector must be the same length as the vector to be subsetted, with TRUE
indicating an element to keep, and FALSE
indicating an element to drop. The following block of code gives the same value as before:
## [1] "chris"
## [1] "chris" "shasta"
3.2.3 Subsetting vectors implicitly
Here are two applications of subsetting on vectors that need distinction to write the correct code:
Explicit subsetting: Suppose someone approaches you a length 10 vector of people’s ages, and say that they want to subset to the 1st, 3rd, and 9th elements.
Implicit subsetting: Suppose someone approaches you a length 10 vector of people’s ages, and say that they want to subset to elements >50 age.
Consider the following vector.
We could subset age
explicitly two ways. Suppose we want to subset the 1st and 5th, and 9th elements. One can do it with numerical indexing vectors:
## [1] 89 66 30
or by logical indexing vectors:
## [1] 89 66 30
and you can do it in one step as we have done so, or two steps by storing the indexing vector as a variable. Either ways is fine.
## [1] 89 66 30
## [1] 89 66 30
For implicit subsetting, we don’t know which elements to select off the top of our head! (We could count, but this method does not scale up.)
Rather, we can figure out which elements to select by using a comparison operator, which returns a logical indexing vector.
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
The comparison operator >
compared the numeric value of age
to see which elements of age is greater than 50, and then returned a logical vector that has TRUE
if age is greater than 50 at that element and FALSE
otherwise.
Then,
## [1] 89 70 64 90 66 71 55 60
## [1] 89 70 64 90 66 71 55 60
To summarize:
Subset a vector implicitly, in 3 steps:
- Come up with a criteria for subsetting: “I want to subset to values greater than 50”.
- We can use a comparison operator to create a logical indexing vector that fits this criteria.
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
- Use this logical indexing vector to subset.
## [1] 89 70 64 90 66 71 55 60
## [1] 89 70 64 90 66 71 55 60
And you are done.
3.2.4 Comparison Operators
We have the following comparison operators in R:
<
less than
<=
less or equal than
==
equal to
!=
not equal to
>
greater than
>=
greater than or equal to
You can also put these comparison operators together to form more complex statements, which you will explore in this week’s exercise.
Another example:
## [1] 90
## [1] 89 70 64 66 71 55 60 30 16
For most of our subsetting tasks on vectors (and dataframes below), we will be encouraging implicit subsetting. The power of implicit subsetting is that you don’t need to know what your vector contains to do something with it! This technique is related to abstraction in programming mentioned in the first lesson: by using expressions to find the specific value you are interested instead of hard-coding the value explicitly, it generalizes your code to handle a wider variety of situations.
3.3 Dataframes
Before we dive into dataframes, check that the tidyverse
package is properly installed by loading it in your R Console:
Here is the data structure you have been waiting for: the dataframe. A dataframe is a spreadsheet such that each column must have the same data type. Think of a bunch of vectors organized as columns, and you get a dataframe.
For the most part, we load in dataframes from a file path (although they are sometimes created by combining several vectors of the same length, but we won’t be covering that here):
3.3.1 Using functions and operations on dataframes
We can run some useful functions on dataframes to get some useful properties, similar to how we used length()
for vectors:
## [1] 1864
## [1] 30
## [1] 1864 30
## [1] "ModelID" "PatientID" "CellLineName"
## [4] "StrippedCellLineName" "Age" "SourceType"
## [7] "SangerModelID" "RRID" "DepmapModelType"
## [10] "AgeCategory" "GrowthPattern" "LegacyMolecularSubtype"
## [13] "PrimaryOrMetastasis" "SampleCollectionSite" "Sex"
## [16] "SourceDetail" "LegacySubSubtype" "CatalogNumber"
## [19] "CCLEName" "COSMICID" "PublicComments"
## [22] "WTSIMasterCellID" "EngineeredModel" "TreatmentStatus"
## [25] "OnboardedMedia" "PlateCoating" "OncotreeCode"
## [28] "OncotreeSubtype" "OncotreePrimaryDisease" "OncotreeLineage"
The last function, colnames()
returns a character vector of the column names of the dataframe. This is an important property of dataframes that we will make use of to subset on it.
We introduce an operation for dataframes: the dataframe$column_name
operation selects for a column by its column name and returns the column as a vector. For instance:
## [1] "Ovary/Fallopian Tube" "Myeloid" "Bowel"
## [4] "Myeloid" "Myeloid"
## [1] 60 36 72 30 30
We treat the resulting value as a vector, so we can perform implicit subsetting:
## [1] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [8] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [15] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [22] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [29] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [36] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [43] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [50] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [57] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [64] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
## [71] "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid" "Myeloid"
The bracket operation [ ]
on a dataframe can also be used for subsetting. dataframe[row_idx, col_idx]
subsets the dataframe by a row indexing vector row_idx
, and a column indexing vector col_idx
.
## ModelID CellLineName
## 1 ACH-000001 NIH:OVCAR-3
## 2 ACH-000002 HL-60
## 3 ACH-000003 CACO2
## 4 ACH-000004 HEL
## 5 ACH-000005 HEL 92.1.7
We can refer to the column names directly:
## ModelID CellLineName
## 1 ACH-000001 NIH:OVCAR-3
## 2 ACH-000002 HL-60
## 3 ACH-000003 CACO2
## 4 ACH-000004 HEL
## 5 ACH-000005 HEL 92.1.7
We can leave the column index or row index empty to just subset columns or rows.
## ModelID PatientID CellLineName StrippedCellLineName Age SourceType
## 1 ACH-000001 PT-gj46wT NIH:OVCAR-3 NIHOVCAR3 60 Commercial
## 2 ACH-000002 PT-5qa3uk HL-60 HL60 36 Commercial
## 3 ACH-000003 PT-puKIyc CACO2 CACO2 72 Commercial
## 4 ACH-000004 PT-q4K2cp HEL HEL 30 Commercial
## 5 ACH-000005 PT-q4K2cp HEL 92.1.7 HEL9217 30 Commercial
## SangerModelID RRID DepmapModelType AgeCategory GrowthPattern
## 1 SIDM00105 CVCL_0465 HGSOC Adult Adherent
## 2 SIDM00829 CVCL_0002 AML Adult Suspension
## 3 SIDM00891 CVCL_0025 COAD Adult Adherent
## 4 SIDM00594 CVCL_0001 AML Adult Suspension
## 5 SIDM00593 CVCL_2481 AML Adult Mixed
## LegacyMolecularSubtype PrimaryOrMetastasis SampleCollectionSite
## 1 Metastatic ascites
## 2 Primary haematopoietic_and_lymphoid_tissue
## 3 Primary Colon
## 4 Primary haematopoietic_and_lymphoid_tissue
## 5 bone_marrow
## Sex SourceDetail LegacySubSubtype CatalogNumber
## 1 Female ATCC high_grade_serous HTB-71
## 2 Female ATCC M3 CCL-240
## 3 Male ATCC HTB-37
## 4 Male DSMZ M6 ACC 11
## 5 Male ATCC M6 HEL9217
## CCLEName COSMICID PublicComments
## 1 NIHOVCAR3_OVARY 905933
## 2 HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE 905938
## 3 CACO2_LARGE_INTESTINE NA
## 4 HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE 907053
## 5 HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE NA
## WTSIMasterCellID EngineeredModel TreatmentStatus OnboardedMedia PlateCoating
## 1 2201 MF-001-041 None
## 2 55 MF-005-001 None
## 3 NA Unknown MF-015-009 None
## 4 783 Post-treatment MF-001-001 None
## 5 NA MF-001-001 None
## OncotreeCode OncotreeSubtype OncotreePrimaryDisease
## 1 HGSOC High-Grade Serous Ovarian Cancer Ovarian Epithelial Tumor
## 2 AML Acute Myeloid Leukemia Acute Myeloid Leukemia
## 3 COAD Colon Adenocarcinoma Colorectal Adenocarcinoma
## 4 AML Acute Myeloid Leukemia Acute Myeloid Leukemia
## 5 AML Acute Myeloid Leukemia Acute Myeloid Leukemia
## OncotreeLineage
## 1 Ovary/Fallopian Tube
## 2 Myeloid
## 3 Bowel
## 4 Myeloid
## 5 Myeloid
## ModelID CellLineName
## 1 ACH-000001 NIH:OVCAR-3
## 2 ACH-000002 HL-60
## 3 ACH-000003 CACO2
## 4 ACH-000004 HEL
## 5 ACH-000005 HEL 92.1.7
## 6 ACH-000006 MONO-MAC-6
The bracket operation on a dataframe can be difficult to interpret because multiple expression for the row and column indicies is a lot of information for one line of code. You will see easier-to-read functions for dataframe subsetting in the next lesson.
Lastly, try running View(metadata)
in RStudio Console…whew, a nice way to examine your dataframe like a spreadsheet program!
3.4 Exercises
You can find exercises and solutions on Posit Cloud, or on GitHub.