 
Chapter 3 Working with data structures
In our second lesson, we start to look at two data structures, vectors and dataframes, that can handle a large amount of data.
Before we jump into these bigger things, we introduce a new kind of operator:
3.1 Comparison Operations
Sometimes, we want to make comparisons between data types, such as if one is bigger than the other, or whether they are the same.
## [1] TRUEWe asked whether age is greater than 18, and it is TRUE because age is 35. We can follow-up to ask if age is equal or less than 65:
## [1] TRUEBesides comparing numbers, we can ask characters whether they are a specific value:
## [1] FALSEWe asked whether building_name is “Weintraub” via the == comparison operator (extremely easy to confuse with =), and it is FALSE because building_name is “Arnold”. We can follow-up to ask if building_name is not equal to “Weintraub”.
## [1] TRUE3.2 Full list of Comparison Operations
< less than
<= less or equal than
> greater than
>= greater than or equal to
== equal to
!= not equal to
You can also write out multiple comparisons at once, which you will see more in your exercise this week…
3.3 Vectors
In the first exercise, you started to explore data structures, which store information about data types. You played around with vectors, which is a ordered collection of a data type. Each element of a vector contains a data type, and all elements of a vector must be the same type, such as numeric, character, or logical.
We often create vectors using the combine function, c() :
If we try to create a vector with mixed data types, R will try to make them be the same data type, or give an error:
## [1] "chris"  "shasta" "123"Our numeric got converted to character so that the entire vector is all characters.
3.3.1 Using operations on vectors
Recall from the first class:
- Expressions are be built out of operations or functions. 
- Operations and functions combine data types to return another data type. 
Now that we are working with data structures, the same principle applies:
- Operations and functions combine data structures to return another data structure (or data type!).
What happens if we use some familiar operations we used for numerics on a numerical vector? If we multiply a numerical vector by a numeric, what do we get?
## [1] 6 9 3All of chrNum’s elements tripled! Our multiplication operation, when used on a numeric vector with a numeric, has a new meaning: it multiplied all the elements by 3. Here’s another example: numeric vector multiplied by another numeric vector:
## [1] 12 18  0Or how about comparison operators?
## [1] TRUE TRUE TRUEbut there are also limits: a numeric vector added to a character vector creates an error:
When we work with operations and functions, we must be mindful what inputs the operation or function takes in, and what outputs it gives, no matter how “intuitive” the operation or function name is.
Lastly, here’s a function you can use on vectors: length() gives you the length of the vector:
## [1] 33.3.2 Subsetting vectors explicitly
In the exercise this past week, you looked at a new operation to subset elements of a vector using brackets. Let’s look at all the possible ways to subset vectors carefully:
We subset vectors using the bracket [ ] operation.
Inside the bracket can be:
- A single numeric value
## [1] "shasta"which returns the second value of staff.
- A numerical indexing vector containing numerical values. They dictate which elements of the vector to subset.
## [1] "chris"  "shasta"Alternatively, you can also store the subetted vector as a new variable:
## [1] "chris"  "shasta"- A logical indexing vector with the same length as the vector to be subsetted. The TRUEvalues indicate which elements to keep, theFALSEvalues indicate which elements to drop.
If we want the first element:
## [1] "chris"If we want the first and second elements:
## [1] "chris"  "shasta"If we want the first and second elements and store the result as a variable:
## [1] "chris"  "shasta"3.3.3 A trick: When subsetting large vectors
Suppose you have a large vector age with 100 elements:
set.seed(123) #don't worry about this function
age = round(runif(100, 1, 100)) #don't worry about these functions
age##   [1] 29 79 41 88 94  6 53 89 56 46 96 46 68 58 11 90 25  5 33 95 89 70 64 99 66
##  [26] 71 55 60 30 16 96 90 69 80  3 48 76 22 32 24 15 42 42 38 16 15 24 47 27 86
##  [51]  6 45 80 13 57 21 14 76 90 38 67 10 39 28 82 45 81 81 80 45 76 63 71  1 48
##  [76] 23 39 62 36 12 25 67 42 79 11 44 99 89 89 18 14 66 35 66 33 20 78 10 47 52Suppose you want the first 20 elements of this vector using a numerical indexing vector. Writing out c(1, 2, 3, 4, … for the numerical indexing vector a pain. We can generate a numerical vector 1 to 20 via the following trick:
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20Then, you just use it to help subset:
##  [1] 29 79 41 88 94  6 53 89 56 46 96 46 68 58 11 90 25  5 33 953.4 Dataframes
Before we dive into dataframes, check that the tidyverse package is properly installed by loading it in your R Console:
Here is the data structure you have been waiting for: the Dataframe. A dataframe is a spreadsheet such that each column must have the same data type. Think of a bunch of vectors organized as columns, and you get a dataframe.
Below is some code to load in a Dataframe. Notice that the file extension here is in .RData, which is a format specifically for R. In the last week of class we will talk about how to load and save spreadsheets from CSVs or Excel.
3.4.1 Using functions and operations on Dataframes
We can run some useful functions on dataframes to get some useful properties, similar to how we used length() for vectors:
## [1] 1864## [1] 30## [1] 1864   30##  [1] "ModelID"                "PatientID"              "CellLineName"          
##  [4] "StrippedCellLineName"   "Age"                    "SourceType"            
##  [7] "SangerModelID"          "RRID"                   "DepmapModelType"       
## [10] "AgeCategory"            "GrowthPattern"          "LegacyMolecularSubtype"
## [13] "PrimaryOrMetastasis"    "SampleCollectionSite"   "Sex"                   
## [16] "SourceDetail"           "LegacySubSubtype"       "CatalogNumber"         
## [19] "CCLEName"               "COSMICID"               "PublicComments"        
## [22] "WTSIMasterCellID"       "EngineeredModel"        "TreatmentStatus"       
## [25] "OnboardedMedia"         "PlateCoating"           "OncotreeCode"          
## [28] "OncotreeSubtype"        "OncotreePrimaryDisease" "OncotreeLineage"The last function, colnames() returns a character vector of the column names of the dataframe. This is an important property of dataframes that we will make use of to subset on it.
We introduce an operation for dataframes: the dataframe$column_name operation selects for a column by its column name and returns the column as a vector. For instance:
## [1] "Ovary/Fallopian Tube" "Myeloid"              "Bowel"               
## [4] "Myeloid"              "Myeloid"## [1] 60 36 72 30 30The bracket operation [ ] on a dataframe can also be used for subsetting rows and columns at once. dataframe[row_idx, col_idx] subsets the dataframe by a row indexing vector row_idx, and a column indexing vector col_idx.
##      ModelID CellLineName
## 1 ACH-000001  NIH:OVCAR-3
## 2 ACH-000002        HL-60
## 3 ACH-000003        CACO2
## 4 ACH-000004          HEL
## 5 ACH-000005   HEL 92.1.7We can refer to the column names directly:
##      ModelID CellLineName
## 1 ACH-000001  NIH:OVCAR-3
## 2 ACH-000002        HL-60
## 3 ACH-000003        CACO2
## 4 ACH-000004          HEL
## 5 ACH-000005   HEL 92.1.7We can leave the column index or row index empty to just subset columns or rows.
##      ModelID PatientID CellLineName StrippedCellLineName Age SourceType
## 1 ACH-000001 PT-gj46wT  NIH:OVCAR-3            NIHOVCAR3  60 Commercial
## 2 ACH-000002 PT-5qa3uk        HL-60                 HL60  36 Commercial
## 3 ACH-000003 PT-puKIyc        CACO2                CACO2  72 Commercial
## 4 ACH-000004 PT-q4K2cp          HEL                  HEL  30 Commercial
## 5 ACH-000005 PT-q4K2cp   HEL 92.1.7              HEL9217  30 Commercial
##   SangerModelID      RRID DepmapModelType AgeCategory GrowthPattern
## 1     SIDM00105 CVCL_0465           HGSOC       Adult      Adherent
## 2     SIDM00829 CVCL_0002             AML       Adult    Suspension
## 3     SIDM00891 CVCL_0025            COAD       Adult      Adherent
## 4     SIDM00594 CVCL_0001             AML       Adult    Suspension
## 5     SIDM00593 CVCL_2481             AML       Adult         Mixed
##   LegacyMolecularSubtype PrimaryOrMetastasis               SampleCollectionSite
## 1                                 Metastatic                            ascites
## 2                                    Primary haematopoietic_and_lymphoid_tissue
## 3                                    Primary                              Colon
## 4                                    Primary haematopoietic_and_lymphoid_tissue
## 5                                                                   bone_marrow
##      Sex SourceDetail  LegacySubSubtype CatalogNumber
## 1 Female         ATCC high_grade_serous        HTB-71
## 2 Female         ATCC                M3       CCL-240
## 3   Male         ATCC                          HTB-37
## 4   Male         DSMZ                M6        ACC 11
## 5   Male         ATCC                M6       HEL9217
##                                     CCLEName COSMICID PublicComments
## 1                            NIHOVCAR3_OVARY   905933               
## 2    HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE   905938               
## 3                      CACO2_LARGE_INTESTINE       NA               
## 4     HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE   907053               
## 5 HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE       NA               
##   WTSIMasterCellID EngineeredModel TreatmentStatus OnboardedMedia PlateCoating
## 1             2201                                     MF-001-041         None
## 2               55                                     MF-005-001         None
## 3               NA                         Unknown     MF-015-009         None
## 4              783                  Post-treatment     MF-001-001         None
## 5               NA                                     MF-001-001         None
##   OncotreeCode                  OncotreeSubtype    OncotreePrimaryDisease
## 1        HGSOC High-Grade Serous Ovarian Cancer  Ovarian Epithelial Tumor
## 2          AML           Acute Myeloid Leukemia    Acute Myeloid Leukemia
## 3         COAD             Colon Adenocarcinoma Colorectal Adenocarcinoma
## 4          AML           Acute Myeloid Leukemia    Acute Myeloid Leukemia
## 5          AML           Acute Myeloid Leukemia    Acute Myeloid Leukemia
##        OncotreeLineage
## 1 Ovary/Fallopian Tube
## 2              Myeloid
## 3                Bowel
## 4              Myeloid
## 5              Myeloid##      ModelID CellLineName
## 1 ACH-000001  NIH:OVCAR-3
## 2 ACH-000002        HL-60
## 3 ACH-000003        CACO2
## 4 ACH-000004          HEL
## 5 ACH-000005   HEL 92.1.7
## 6 ACH-000006   MONO-MAC-6The bracket operation on a dataframe can be difficult to interpret because multiple expression for the row and column indicies is a lot of information for one line of code. You will see easier-to-read functions for dataframe subsetting in the next lesson.
Lastly, try running View(metadata) in RStudio Console…whew, a nice way to examine your dataframe like a spreadsheet program!
3.5 Exercises
You can find exercises and solutions on Posit Cloud, or on GitHub.
