- Data structures
- Extra metadata
- Factoring
- Inspecting data structures
- Column renaming
- Re-ordering columns
- Selecting multiple columns at once
Data structures
R uses several kinds of data structures:
- Variables - a variable contains a single value. The can contain a number (integer or double), characters or a logical value.
- Vectors - A vector is a list of contain only one data type.
- Lists - A list is a list (surprise, surprise) that can contain different data-types for each element. A list entry could also contain a vector, another list or even a table.
- Tables - A table is usually a combination of vectors, in which each vector functions as a column. R had different implementations of tables.
- The data frame is the standard R table solution.
- tbl is a more more user friendly implementation of a data frame from the tibble library, which is part of the tidyverse library
- A data frame can be converted to a table using the as_data_frame(data_frame) function
- data.table is most commonly used for very large data-sets, but even when using the BIS data of a country I did not have to use this.
Extra metadata
You can use ‘Attributes’ to store all kinds of extra metadata about any kind of object, be it lists, vectors or tables. You’ve probably seen them already in the Environment window of R Studio, or when using the str function. You can recognise them there because they have a prefix that looks something like
- attr(*, "spec")
In this case the attribute is called spec. Let’s say this attribute is part of the data frame ‘df’. When we want to see what is in that attribute, we can get a listing like this:
attr(df, "spec")
As is pretty much expected, you can set an attribute like this:
attr(df, "spec") <- "This is an attribute text"
In this example I only assigned a value to the attribute ‘spec’, but you can assign any type of data to it.
Factoring
Factoring data is used so that R understands you are using variable to make distinctions between groups of data, instead of it being a variable that is a name or identifier. Factors can be categorical variables, or ordinal variables. When factoring a variable, R transforms the actual value to an internal value. It is especially important to realize this when treating a numeric value as a factored variable. After factoring a numeric variable, you cannot automatically assume that the mutation you make on that variable is the one you expect it to be. For example: when adding 1 to a factored numeric variable:
mtcars %>%
mutate(fac_gear = factor(gear, ordered = TRUE)) %>%
mutate(gear_plus = gear + 1) %>%
mutate(fac_gear_plus = as.numeric(fac_gear) + 1) %>%
select(gear, gear_plus, fac_gear, fac_gear_plus)
TODO
By default the order of factors is determined by sorting the values, if you want to specify your own factor ordering you can define a factored variable like this:
rating_pd <- factor(rating_pd, levels=c("AAA", "AA", "A", "BBB", "BB", "B", "CCC", "CC", "C", "D"), ordered=TRUE)
If the variable was already factored and the order is is not as it should be, the ordering can be adjusted as in the example:
rating_pd = gdata::reorder.factor(rating_pd, new.order=c("AAA", "AA", "A", "BBB", "BB", "B", "CCC", "CC", "C", "D"))
Dropping levels from factors
drop.levels(tbl_revenue$sector)
Inspecting data structures
The glimpse() function is a great alternative for the str() function: is shows data-types in a compacter way, and the screen size is taken into consideration for the output.
Column renaming
After importing files I usually rename the columns so they adhere to the in-script conventions, to prevent messy data joining and searching for column names forever. You can rename all column names at once by using:
names(table) <- c("column_a", "column_b", "column_c", "column_d", "column_e")
Renaming a single column or select columns using the dplyr library:
rename(mtcars, spam_mpg = mpg, cylinders = cyl)
Re-ordering columns
You can reorder all columns by select_ing them one by one, but by making use of the _everything() function you can put all columns, except the re-ordered in at once:
mtcars %<>% select(cyl, disp, hp, everything())
Selecting multiple columns at once
If you adhere to certain column naming conventions (like using the prefix amt for currency columns), you can use certain functions to select multiple columns in one statement.
- starts_with() - starts with a prefix
- ends_with() - ends with a prefix
- contains() - contains a literal string
- matches() - matches a regular expression
- num_range() - a numerical range like x01, x02, x03.
- one_of() - variables in character vector.
- everything() - all variables.
An example with the iris data-set (form the tidyverse) is:
select(iris, starts_with("Petal"))