With increased computing power comes increased access to large amounts of freely accessible data. People are tracking their lives with productivity, calorie, fitness and sleep trackers. Governments are publishing survey data left and right, and companies conduct audience testing that needs analyzing. There’s a lot of data out there even now, ready to be grabbed and looked at.
In this tutorial, we’ll look at the basics of the R programming language — a language built solely for statistical computing. I won’t bore you with Wikipedia definitions. Instead, let’s dive right into it. In this introduction, we’ll cover the installation of the default IDE and language, and its data types.
R is both a programming language and a software environment, which means it’s fully self-contained. There are two steps to getting it installed:
- download and install the latest R: www.r-project.org
- download and install RStudio, the R IDE: www.rstudio.com
Both are free, both open source. R will be installed as the underlying engine that powers RStudio’s computations, while RStudio will provide sample data, command autocompletion, help files, and an effective interface for getting things done quickly. You could write R code in simple text files as in most other languages, but that’s really not recommended given how many commands there are and how complex things can quickly get.
After you’ve installed the tools, launch R Studio.
Let’s briefly explain the GUI. There are four main parts. I’ll explain the default order, though note that this can be changed in Settings/Preferences > Pane Layout.
The top left quadrant is the editor. It’s where you write R code you want to keep for later — functions, classes, packages, etc. This is, for all intents and purposes, identical to every other code editor’s main window. Apart from some self-explanatory buttons, and others that needn’t concern you at this starting point, there’s also a “Source on Save” checkbox. This means “Load the contents of the file into my console’s runtime every time I save the file”. You should have this on at all times, as it makes your development flow faster by one click.
The lower left quadrant is the console. It’s a REPL for R in which you can test out your ideas, datasets, filters, and functions. This is where you’ll be spending most of your time in the beginning. Here’s where you verify that an idea you had works before copying it over into the editor above. This is also the environment into which your R files will be sourced on save (see above), so whenever you develop a new function in an R file above, it automatically becomes available in this REPL. We’ll be spending a lot of time in the REPL in the remainder of this tutorial.
History / Environment
The top right quadrant has two tabs: Environment and History.
Environment refers to the console environment (see above) and will list, in detail, every single symbol you defined in the console (whether via sourcing or directly). That is, if you have a function available in the REPL, it will be listed in the environment. If you have a variable, or a dataset, it will be listed there. This is where you can also import custom datasets manually and make them instantly available in the console, if you don’t feel like typing out the commands to do so. You can also inspect the environment of other packages you installed and loaded. (More on packages at a later time.) Go ahead and play around with it; you can’t break anything.
History lists every single console command you executed since the last project started. It’s saved into a hidden
.Rhistory file in your project’s folder. If you don’t choose to save your environment after a session, the history won’t be saved.
The bottom right panel is the misc panel, and contains five separate tabs. The first one, Files, is self-explanatory.
The Plots tab will contain the graphs you generated with R. It’s here that you can zoom, export, configure and inspect your charts and plots.
The Packages tab lets you install additional packages into R. A brief description is next to each available package, though there are many more than those listed there. We’ll go through package repositories in a later post.
The Help tab lets you search the incredibly extensive help directory and will automatically open whenever you call help on a command in the console. (Help is called by prepending a command name with a question mark, like so:
Finally, the Viewer is essentially RStudio’s built-in browser. Yes, you can develop web apps with R and even launch locally hosted web apps within it.
In the text below, whenever I mention using a command, assume this means punching it into the console. So, if I say “We look at the help for DataFrames with
?data.frame”, you do this:
RStudio comes with some datasets for new users to play around with. To use a built-in dataset, we load it with the
data function, and supply an argument corresponding to the set we want. To see all the available built-in sets, punch in
data(), without an argument.
Looking at the list of available datasets, let’s load a very small one for starters:
You should see the
women variable appear in the Environment panel, though its second field says
<Promise>. A promise in this case merely means “The data will be there when you actually need it”. We told R to load this set, but we haven’t actually used it anywhere, so it didn’t feel the need to load it fully into memory. Let’s tell R we need it. In the console, print out the entire set by simply calling this:
This is equivalent to:
Note: we’ll be using the former approach, simply because it’s less typing. Remember: in R, the last value that’s typed out without being an expression (like assigning or summing something) is what gets auto-printed to the console.
The numbers will be produced in the console, and the Environment entry for
women should change. You should be able to see the data in the environment panel now, too, by clicking the blue expand arrow next to the variable name.
This set only has 15 entries, and as such offers nothing of value, but it’s good enough for playing around in.
To further study the set you’re dealing with, there are several functions to keep in mind (a demonstration of each can be seen below explanations):
ncolwill list the number of rows/columns respectively.
summarywill output a summary about the set’s columns. In the case of the
womenset, we have two numeric columns (both columns are numeric, or in other words, each column is a numeric vector; more on data types and vectors later). And R knows that, when you ask it for an analysis of a numeric vector, it should give you the typical values for such collections: the minimum value in the set, the mean (average) between the minimum and the mean, the mean (average of all values), the mean between the mean and the maximum, and the maximum, the largest number in the column. It does this for both height and width. For different types of vectors (like ones where every element is a word instead of a number) the output is different.
stris a different kind of summary. In fact,
strstands for “structure” and it outputs a summary of a dataset’s structure. In our case, it will tell us that it’s a “data.frame” (a special data type we’ll explain later) with 15 obs (observations or rows) and two variables (or columns). It then proceeds to list all the columns in the DataFrame with some (but not all) of their values, just so we get a grasp on the kind of values we’re dealing with.
dimgives you the dimensions of a dataset. Calling
15 2, which means 15 rows and two columns.
lengthcan be used to count the number of vertical elements in a set. In vectors (see below), this is the number of elements; in data sets like
women, this is the number of columns:
> nrow(women)  15 > ncol(women)  2 > summary(women) height weight Min. :58.0 Min. :115.0 1st Qu.:61.5 1st Qu.:124.5 Median :65.0 Median :135.0 Mean :65.0 Mean :136.7 3rd Qu.:68.5 3rd Qu.:148.0 Max. :72.0 Max. :164.0 > str(women) 'data.frame': 15 obs. of 2 variables: $ height: num 58 59 60 61 62 63 64 65 66 67 ... $ weight: num 115 117 120 123 126 129 132 135 139 142 ... > dim(women)  15 2
You’ll be using these functions a lot, so I recommend you get familiar with them. Load some of the other datasets and inspect them like this. There’s no need to know them by heart. This tutorial and the help files will always be around for reference, but it’s nice to be fluent in them anyway.
R has some typical atomic data types you already know about from other languages, but it also provides some more statistics-inclined ones. Let’s briefly go through them. While explaining these types, I’ll talk about assigning them. Assigning in R is done with the “left arrow” operator or
<-, as in:
myString <- 'Hello, World!'
R is, however, very forgiving and will let you use the
= assignment operator in top level environments like the console, if you don’t feel like typing out the arrow every time:
myString = 'Hello, World!'
I suggest you get used to the arrow, though, as you won’t get very far without it.
To check the type (or class) of a variable, the
class function can be used (though
str from above does almost the same thing):
Atomic classes are basic types from which others are constructed.
character class is your typical string, a set of one or more letters:
> myString <- 'Hello, World!' > class(myString)  'character'
 will be explained below, in the “Vectors” section.
numeric class corresponds to
float in other languages. It indicates numeric values like 10, 15.6, -48792.5498982749879 and so on:
> myNum <- 5.983904798274987298 > class(myNum)  'numeric'
You can coerce (change type of) numeric string values into numeric types, like so:
> myString <- '5.60' > class(myString)  'character' > myNumber <- as.numeric(myString) > myNumber  5.6 > class(myNumber)  'numeric'
There’s also a special number
Inf which represents infinity. It can be used in calculations:
> 1/0  Inf
Another “number” is
NaN, which stands for “Not a Number”. This is what you get when you do something like
Integers are whole numbers, though they get autocoerced (changed) into numerics when saved into variables:
> myInt <- 209173987 > class(myInt)  'numeric'
To actually force them to be integers, we need to invoke a function that manually coerces them, called
> myInt <- as.integer(myInt) > class(myInt)  'integer'
You can prevent autocoercion by setting integers with an
> myInt = 5L > class(myInt)  'integer'
Note that, if you give R a number that’s greater than what its memory can store, it autocoerces it into a real number, even if you put
L at the end:
> myInt <- 2479827498237498723498729384 > class(myInt)  'numeric' > myInt  2.479827e+27
But if you then try to coerce that number into an integer, R will discard it because it simply can’t make integers that big. Instead of a number, you get “NA”, which is a special type in R indicating “Not Available”, also known as a missing value:
> myIntCoerced <- as.integer(myInt) Warning message: NAs introduced by coercion > myIntCoerced  NA > class(myIntCoerced)  'integer'
The NA is still a type of “integer”, but one without value.
Note that, when coercing numerics into integers, decimal places get lost. The same applies to coercing from numeric decimal strings:
> myString <- '5.60' > myNumeric <- 5.6 > myInteger1 <- as.integer(myString) > myInteger2 <- as.integer(myNumeric) > myInteger1 == myInteger2  TRUE > myInteger1  5
Explaining complex numbers is a bit outside the scope of this tutorial, particularly if you weren’t exposed to them in school, but if you’re curious, you can find out more about complex numbers on Wikipedia. They take the form of
a + bi where
b are real numbers and
i is imaginary. In R, they’re constructed with a special
> myComplex <- complex(1, 3292, 8974892) > myComplex  3292+8974892i > class(myComplex)  'complex'
You won’t be needing those nearly as often as you might need the other types, but if you want to know more about the
complex function, just call for help on it:
Logical types (booleans) are the same as in most other languages and can be two things: either true, or false. True can be represented with
T, while false is, predictably,
> TRUE == T  TRUE > myBool <- TRUE > myBool == T  TRUE > myComparison <- 5 > 6 > myComparison == FALSE  TRUE > class(myComparison)  'logical'
Whenever you create an expression, the result of which is a “yes” or “no” value, you get a
FALSE — like in the case of
5 > 6 above. 5 is not greater than 6, so the expression becomes
FALSE thus yields
TRUE because the
myComparison variable indeed contains a value of
When needed by a function, logical values will be coerced into numerics. This means that, if I write
1 + TRUE, the console will produce
1 + FALSE gives
1. Likewise, we can easily coerce other types into logicals (
as.logical(myVariable)). Any numeric or integer with a value not equal to 0 or NA will give
TRUE. 0 and 0L will give
FALSE. Strings like “True”, “TRUE”, “true” and “T” will be turned into TRUE, “False”, “FALSE”, “false” and “F” will be
FALSE. Any other string will coerce into a logical NA value.
Higher types are types composed of the lower ones.
Vectors and Lists
The most essential of all, the vector, is a collection of elements of the same type. In our earlier dataset example, one column of the
women dataset was a numeric vector, meaning it was a collection with only numeric values in it. A vector can only have elements of the exact same type. Vectors can be created with the
vector function, but are usually created with the shorthand
c (concatenate) function:
> myVector <- c('Hello', 'World', 'Third Element') > class(myVector)  'character' > myVector  'Hello' 'World' 'Third Element'
We can see here that the vector’s class is “character”, meaning that it contains only character type values. If we print it out by just calling its variable name, we get all three elements and a
 means literally: “I am outputting the contents of your vector. The first element on this line is element number 1 in the vector”.
What you see in
 depends entirely on the size of your console panel and the length of the array. For example:
> myVector <- c('One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight', 'Nine', 'Ten', 'Eleven', 'Twelve', 'Thirteen', 'Fourteen', 'Fifteen') > myVector  'One' 'Two' 'Three' 'Four' 'Five' 'Six' 'Seven'  'Eight' 'Nine' 'Ten' 'Eleven' 'Twelve' 'Thirteen' 'Fourteen'  'Fifteen'
The number in square brackets simply means “The element after me is Nth in the set”. It’s there purely to make the output more readable, and doesn’t affect actual data.
Note that vectors are strictly one-dimensional. You can’t add another vector as an element inside an existing vector; their elements get merged into one:
> v1 <- c('a', 'b', 'c') > v2 <- c('d', 'e', 'f') > v3 <- c(v1, v2) > v3  'a' 'b' 'c' 'd' 'e' 'f'
You can generate entire numeric vectors by specifying a range:
> myRange <- c(1:10) > myRange  1 2 3 4 5 6 7 8 9 10
Lists are just like vectors, only they don’t have the limitation of being able to hold elements of the same type exclusively. They’re built with the
list function or with the
c function if one of the elements you’re adding is a list:
> myList <- list(5, 'Hello', 'Worlds', TRUE) > class(myList)  'list' > myList []  5 []  'Hello' []  'Worlds' []  TRUE
As with vectors, the one-dimensionality rule applies. Adding a list into another will merge their elements.
[[N]] output means “The first element of this list is a vector with a single element 5, the second element of this list is a vector with a single element Hello… etc.” Appending the
[[N]] to the variable that holds the list actually returns the element. We can check this easily:
> class(myList[])  'numeric' > class(myList[])  'character' > class(myList[])  'character' > class(myList[])  'logical'
DataFrames are essentially tables with rows and columns, much like spreadsheets. The
women dataset we loaded above was a DataFrame. You can access individual columns of DataFrames by using the
$ operator on the variable, followed by the column name:
> women$height  58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
The result gets returned as a numeric vector. Check its class with
If the column name contains spaces, you can wrap it in quotation marks (
women$"Female Height"), but you can also access a column by its numeric position in the list of columns. For example, we know
height is the first column:
> head(women) height 1 58 2 59 3 60 4 61 5 62 6 63 > head(women[])  58 59 60 61 62 63
head function tells R to only return the first six results. This is so we keep our console nice and scroll-free, which is excellent for brief looks into datasets without printing them out in their entirety. (The opposite can be achieved with
tail, which prints out the last six results.)
We can see here that, when accessing the column with single-square-bracket
, we get a DataFrame but with one less column. If, however, we access it with the double-square-bracket
[], we get a numeric vector of heights. The data returned by using the single bracket stays the same type as the parent data, while the double bracket accessor targets the specific values in that column and returns them in their most rudimentary form, a numeric vector.
We create DataFrames with the
> men <- data.frame(height = c(50:65), weight = c(150:165)) > head(men) height weight 1 50 150 2 51 151 3 52 152 4 53 153 5 54 154 6 55 155
Here, we created a sample
men dataset not unlike the
women set from before.
If we want to get just the names of the columns, we use the
names function. We can even assign a value to it and thus change the column names:
> names(men)  'height' 'weight' > names(men) <- c('Male Height', 'Male Weight') > head(men) Male Height Male Weight 1 50 150 2 51 151 3 52 152 4 53 153 5 54 154 6 55 155
Matrices are multi-dimensional vectors. They are like DataFrames, but can only contain values of the same type. They’re created with the
matrix function and need the number of rows and columns as parameters, and the values to place into these slots:
> m <- matrix(nrow = 4, ncol = 5, 1:20) > m [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20
What happens if the number of values provided doesn’t match the number of cells?
> m <- matrix(nrow = 4, ncol = 5, 1:25) Warning message: In matrix(nrow = 4, ncol = 5, 1:25) : data length  is not a sub-multiple or multiple of the number of rows  > m [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 17 [2,] 2 6 10 14 18 [3,] 3 7 11 15 19 [4,] 4 8 12 16 20
We get a warning, but the matrix is created anyway, with all the extra values simply discarded. If there are fewer values than cells, the values provided get recycled until the matrix is full:
> m <- matrix(nrow = 4, ncol = 5, 1:16) Warning message: In matrix(nrow = 4, ncol = 5, 1:16) : data length  is not a sub-multiple or multiple of the number of columns  > m [,1] [,2] [,3] [,4] [,5] [1,] 1 5 9 13 1 [2,] 2 6 10 14 2 [3,] 3 7 11 15 3 [4,] 4 8 12 16 4
Much as DataFrames have the
names attribute/function, matrices have a
dim (dimension) one. Changing this property can mutate the form of a matrix:
> m <- 1:15 > m  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > dim(m) NULL > dim(m) <- c(3,5) > m [,1] [,2] [,3] [,4] [,5] [1,] 1 4 7 10 13 [2,] 2 5 8 11 14 [3,] 3 6 9 12 15 > dim(m)  3 5 > dim(m) <- c(5,3) > m [,1] [,2] [,3] [1,] 1 6 11 [2,] 2 7 12 [3,] 3 8 13 [4,] 4 9 14 [5,] 5 10 15 > dim(m)  5 3 > dim(m) <- NULL > m  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Here, we’ve created a numeric vector of 15 sequential numbers (in such cases, we can omit the
dim attribute didn’t exist on it, as evident by the
dim function returning NULL, so we changed it by assigning a numeric vector of
3, 5 to it. This resulted in a reshuffling of the elements to fit into the newly constructed matrix. We then changed the dimensions again by inverting the number of columns and rows to
5, 3, which again produced a different matrix. Finally, nullifying the dimensions produced the numeric vector from the beginning.
It’s also possible to combine/expand/alter vectors, DataFrames and matrices by using
> v <- 1:5 > x <- 6:10 > bound <- cbind(v, x) > bound v x [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10 > bound <- rbind(v, x) > bound [,1] [,2] [,3] [,4] [,5] v 1 2 3 4 5 x 6 7 8 9 10
Factors are vectors with labels. This is different from character vectors, or even numeric ones, because they’re what R folks call “self describing”, allowing R’s functions to automatically make more sense of them than they would of the other types. They’re built with the
factor function, which needs to be fed a vector as an argument:
> f <- factor(c('Hello', 'World', 'Hello', 'Annie', 'Hello', 'World')) > f  Hello World Hello Annie Hello World Levels: Annie Hello World > table(f) f Annie Hello World 1 3 2
Levels lists out the unique elements in the factor. The
table function grabs a factor and builds a table consisting of the various levels in the factor, and the number of their occurrences in the factor.
I personally haven’t found a use for factors yet in my projects, but I’m learning about them.
In this tutorial, we’ve covered the basic data types in R and the essentials of using RStudio. You’re now armed with all the knowledge you need to start some basic data operations. Remember that all the functions we covered above are fully searchable in the help files with
?function, where “function” is the function name.
When learning a programming language, it’s customary to teach data types first, logical operators second, and control structures third, before moving into advanced things like functions and classes. But in this tutorial, I believe we’ve laid a decent enough foundation to jump straight into the fire and learn by example.