With increased computing power comes increased access to large amounts of freely accessible data. People are tracking their lives with productivity, calorie, fitness and sleep trackers. Governments are publishing survey data left and right, and companies conduct audience testing that needs analyzing. There’s a lot of data out there even now, ready to be grabbed and looked at.
In this tutorial, we’ll look at the basics of the R programming language — a language built solely for statistical computing. I won’t bore you with Wikipedia definitions. Instead, let’s dive right into it. In this introduction, we’ll cover the installation of the default IDE and language, and its data types.
Installing
R is both a programming language and a software environment, which means it’s fully self-contained. There are two steps to getting it installed:
- download and install the latest R: www.r-project.org
- download and install RStudio, the R IDE: www.rstudio.com
Both are free, both open source. R will be installed as the underlying engine that powers RStudio’s computations, while RStudio will provide sample data, command autocompletion, help files, and an effective interface for getting things done quickly. You could write R code in simple text files as in most other languages, but that’s really not recommended given how many commands there are and how complex things can quickly get.
After you’ve installed the tools, launch R Studio.
IDE Areas
Let’s briefly explain the GUI. There are four main parts. I’ll explain the default order, though note that this can be changed in Settings/Preferences > Pane Layout.
The Editor
The top left quadrant is the editor. It’s where you write R code you want to keep for later — functions, classes, packages, etc. This is, for all intents and purposes, identical to every other code editor’s main window. Apart from some self-explanatory buttons, and others that needn’t concern you at this starting point, there’s also a “Source on Save” checkbox. This means “Load the contents of the file into my console’s runtime every time I save the file”. You should have this on at all times, as it makes your development flow faster by one click.
The Console
The lower left quadrant is the console. It’s a REPL for R in which you can test out your ideas, datasets, filters, and functions. This is where you’ll be spending most of your time in the beginning. Here’s where you verify that an idea you had works before copying it over into the editor above. This is also the environment into which your R files will be sourced on save (see above), so whenever you develop a new function in an R file above, it automatically becomes available in this REPL. We’ll be spending a lot of time in the REPL in the remainder of this tutorial.
History / Environment
The top right quadrant has two tabs: Environment and History.
Environment refers to the console environment (see above) and will list, in detail, every single symbol you defined in the console (whether via sourcing or directly). That is, if you have a function available in the REPL, it will be listed in the environment. If you have a variable, or a dataset, it will be listed there. This is where you can also import custom datasets manually and make them instantly available in the console, if you don’t feel like typing out the commands to do so. You can also inspect the environment of other packages you installed and loaded. (More on packages at a later time.) Go ahead and play around with it; you can’t break anything.
History lists every single console command you executed since the last project started. It’s saved into a hidden .Rhistory
file in your project’s folder. If you don’t choose to save your environment after a session, the history won’t be saved.
Misc
The bottom right panel is the misc panel, and contains five separate tabs. The first one, Files, is self-explanatory.
The Plots tab will contain the graphs you generated with R. It’s here that you can zoom, export, configure and inspect your charts and plots.
The Packages tab lets you install additional packages into R. A brief description is next to each available package, though there are many more than those listed there. We’ll go through package repositories in a later post.
The Help tab lets you search the incredibly extensive help directory and will automatically open whenever you call help on a command in the console. (Help is called by prepending a command name with a question mark, like so: ?data.frame
.)
Finally, the Viewer is essentially RStudio’s built-in browser. Yes, you can develop web apps with R and even launch locally hosted web apps within it.
Built-in datasets
In the text below, whenever I mention using a command, assume this means punching it into the console. So, if I say “We look at the help for DataFrames with ?data.frame
”, you do this:
RStudio comes with some datasets for new users to play around with. To use a built-in dataset, we load it with the data
function, and supply an argument corresponding to the set we want. To see all the available built-in sets, punch in data()
, without an argument.
Looking at the list of available datasets, let’s load a very small one for starters:
data('women')
You should see the women
variable appear in the Environment panel, though its second field says <Promise>
. A promise in this case merely means “The data will be there when you actually need it”. We told R to load this set, but we haven’t actually used it anywhere, so it didn’t feel the need to load it fully into memory. Let’s tell R we need it. In the console, print out the entire set by simply calling this:
women
This is equivalent to:
print(women)
Note: we’ll be using the former approach, simply because it’s less typing. Remember: in R, the last value that’s typed out without being an expression (like assigning or summing something) is what gets auto-printed to the console.
The numbers will be produced in the console, and the Environment entry for women
should change. You should be able to see the data in the environment panel now, too, by clicking the blue expand arrow next to the variable name.
This set only has 15 entries, and as such offers nothing of value, but it’s good enough for playing around in.
To further study the set you’re dealing with, there are several functions to keep in mind (a demonstration of each can be seen below explanations):
nrow
/ncol
will list the number of rows/columns respectively.summary
will output a summary about the set’s columns. In the case of thewomen
set, we have two numeric columns (both columns are numeric, or in other words, each column is a numeric vector; more on data types and vectors later). And R knows that, when you ask it for an analysis of a numeric vector, it should give you the typical values for such collections: the minimum value in the set, the mean (average) between the minimum and the mean, the mean (average of all values), the mean between the mean and the maximum, and the maximum, the largest number in the column. It does this for both height and width. For different types of vectors (like ones where every element is a word instead of a number) the output is different.str
is a different kind of summary. In fact,str
stands for “structure” and it outputs a summary of a dataset’s structure. In our case, it will tell us that it’s a “data.frame” (a special data type we’ll explain later) with 15 obs (observations or rows) and two variables (or columns). It then proceeds to list all the columns in the DataFrame with some (but not all) of their values, just so we get a grasp on the kind of values we’re dealing with.dim
gives you the dimensions of a dataset. Callingdim(women)
gives us15 2
, which means 15 rows and two columns.length
can be used to count the number of vertical elements in a set. In vectors (see below), this is the number of elements; in data sets likewomen
, this is the number of columns:
> nrow(women)
[1] 15
> ncol(women)
[1] 2
> summary(women)
height weight
Min. :58.0 Min. :115.0
1st Qu.:61.5 1st Qu.:124.5
Median :65.0 Median :135.0
Mean :65.0 Mean :136.7
3rd Qu.:68.5 3rd Qu.:148.0
Max. :72.0 Max. :164.0
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
> dim(women)
[1] 15 2
You’ll be using these functions a lot, so I recommend you get familiar with them. Load some of the other datasets and inspect them like this. There’s no need to know them by heart. This tutorial and the help files will always be around for reference, but it’s nice to be fluent in them anyway.
Data Types
R has some typical atomic data types you already know about from other languages, but it also provides some more statistics-inclined ones. Let’s briefly go through them. While explaining these types, I’ll talk about assigning them. Assigning in R is done with the “left arrow” operator or <-
, as in:
myString <- 'Hello, World!'
R is, however, very forgiving and will let you use the =
assignment operator in top level environments like the console, if you don’t feel like typing out the arrow every time:
myString = 'Hello, World!'
I suggest you get used to the arrow, though, as you won’t get very far without it.
To check the type (or class) of a variable, the class
function can be used (though str
from above does almost the same thing): class(myString)
.
Atomics
Atomic classes are basic types from which others are constructed.
Character
The character
class is your typical string, a set of one or more letters:
> myString <- 'Hello, World!'
> class(myString)
[1] 'character'
The [1]
will be explained below, in the “Vectors” section.
Numeric
The numeric
class corresponds to float
in other languages. It indicates numeric values like 10, 15.6, -48792.5498982749879 and so on:
> myNum <- 5.983904798274987298
> class(myNum)
[1] 'numeric'
You can coerce (change type of) numeric string values into numeric types, like so:
> myString <- '5.60'
> class(myString)
[1] 'character'
> myNumber <- as.numeric(myString)
> myNumber
[1] 5.6
> class(myNumber)
[1] 'numeric'
There’s also a special number Inf
which represents infinity. It can be used in calculations:
> 1/0
[1] Inf
Another “number” is NaN
, which stands for “Not a Number”. This is what you get when you do something like 0/0
.
Integer
Integers are whole numbers, though they get autocoerced (changed) into numerics when saved into variables:
> myInt <- 209173987
> class(myInt)
[1] 'numeric'
To actually force them to be integers, we need to invoke a function that manually coerces them, called as.integer
:
> myInt <- as.integer(myInt)
> class(myInt)
[1] 'integer'
You can prevent autocoercion by setting integers with an L
suffix:
> myInt = 5L
> class(myInt)
[1] 'integer'
Note that, if you give R a number that’s greater than what its memory can store, it autocoerces it into a real number, even if you put L
at the end:
> myInt <- 2479827498237498723498729384
> class(myInt)
[1] 'numeric'
> myInt
[1] 2.479827e+27
But if you then try to coerce that number into an integer, R will discard it because it simply can’t make integers that big. Instead of a number, you get “NA”, which is a special type in R indicating “Not Available”, also known as a missing value:
> myIntCoerced <- as.integer(myInt)
Warning message:
NAs introduced by coercion
> myIntCoerced
[1] NA
> class(myIntCoerced)
[1] 'integer'
The NA is still a type of “integer”, but one without value.
Note that, when coercing numerics into integers, decimal places get lost. The same applies to coercing from numeric decimal strings:
> myString <- '5.60'
> myNumeric <- 5.6
> myInteger1 <- as.integer(myString)
> myInteger2 <- as.integer(myNumeric)
> myInteger1 == myInteger2
[1] TRUE
> myInteger1
[1] 5
Complex
Explaining complex numbers is a bit outside the scope of this tutorial, particularly if you weren’t exposed to them in school, but if you’re curious, you can find out more about complex numbers on Wikipedia. They take the form of a + bi
where a
and b
are real numbers and i
is imaginary. In R, they’re constructed with a special complex
function:
> myComplex <- complex(1, 3292, 8974892)
> myComplex
[1] 3292+8974892i
> class(myComplex)
[1] 'complex'
You won’t be needing those nearly as often as you might need the other types, but if you want to know more about the complex
function, just call for help on it: ?complex
.
Logical
Logical types (booleans) are the same as in most other languages and can be two things: either true, or false. True can be represented with TRUE
or T
, while false is, predictably, FALSE
or F
:
> TRUE == T
[1] TRUE
> myBool <- TRUE
> myBool == T
[1] TRUE
> myComparison <- 5 > 6
> myComparison == FALSE
[1] TRUE
> class(myComparison)
[1] 'logical'
Whenever you create an expression, the result of which is a “yes” or “no” value, you get a TRUE
or FALSE
— like in the case of 5 > 6
above. 5 is not greater than 6, so the expression becomes FALSE
. Comparing myComparison
to FALSE
thus yields TRUE
because the myComparison
variable indeed contains a value of FALSE
.
When needed by a function, logical values will be coerced into numerics. This means that, if I write 1 + TRUE
, the console will produce 2
, whereas 1 + FALSE
gives 1
. Likewise, we can easily coerce other types into logicals (as.logical(myVariable)
). Any numeric or integer with a value not equal to 0 or NA will give TRUE
. 0 and 0L will give FALSE
. Strings like “True”, “TRUE”, “true” and “T” will be turned into TRUE, “False”, “FALSE”, “false” and “F” will be FALSE
. Any other string will coerce into a logical NA value.
Higher Types
Higher types are types composed of the lower ones.
Vectors and Lists
The most essential of all, the vector, is a collection of elements of the same type. In our earlier dataset example, one column of the women
dataset was a numeric vector, meaning it was a collection with only numeric values in it. A vector can only have elements of the exact same type. Vectors can be created with the vector
function, but are usually created with the shorthand c
(concatenate) function:
> myVector <- c('Hello', 'World', 'Third Element')
> class(myVector)
[1] 'character'
> myVector
[1] 'Hello' 'World' 'Third Element'
We can see here that the vector’s class is “character”, meaning that it contains only character type values. If we print it out by just calling its variable name, we get all three elements and a [1]
. The [1]
means literally: “I am outputting the contents of your vector. The first element on this line is element number 1 in the vector”.
What you see in []
depends entirely on the size of your console panel and the length of the array. For example:
> myVector <- c('One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight', 'Nine', 'Ten', 'Eleven', 'Twelve', 'Thirteen', 'Fourteen', 'Fifteen')
> myVector
[1] 'One' 'Two' 'Three' 'Four' 'Five' 'Six' 'Seven'
[8] 'Eight' 'Nine' 'Ten' 'Eleven' 'Twelve' 'Thirteen' 'Fourteen'
[15] 'Fifteen'
The number in square brackets simply means “The element after me is Nth in the set”. It’s there purely to make the output more readable, and doesn’t affect actual data.
Note that vectors are strictly one-dimensional. You can’t add another vector as an element inside an existing vector; their elements get merged into one:
> v1 <- c('a', 'b', 'c')
> v2 <- c('d', 'e', 'f')
> v3 <- c(v1, v2)
> v3
[1] 'a' 'b' 'c' 'd' 'e' 'f'
You can generate entire numeric vectors by specifying a range:
> myRange <- c(1:10)
> myRange
[1] 1 2 3 4 5 6 7 8 9 10
Lists are just like vectors, only they don’t have the limitation of being able to hold elements of the same type exclusively. They’re built with the list
function or with the c
function if one of the elements you’re adding is a list:
> myList <- list(5, 'Hello', 'Worlds', TRUE)
> class(myList)
[1] 'list'
> myList
[[1]]
[1] 5
[[2]]
[1] 'Hello'
[[3]]
[1] 'Worlds'
[[4]]
[1] TRUE
As with vectors, the one-dimensionality rule applies. Adding a list into another will merge their elements.
The [[N]]
output means “The first element of this list is a vector with a single element 5, the second element of this list is a vector with a single element Hello… etc.” Appending the [[N]]
to the variable that holds the list actually returns the element. We can check this easily:
> class(myList[[1]])
[1] 'numeric'
> class(myList[[2]])
[1] 'character'
> class(myList[[3]])
[1] 'character'
> class(myList[[4]])
[1] 'logical'
Data.Frame
DataFrames are essentially tables with rows and columns, much like spreadsheets. The women
dataset we loaded above was a DataFrame. You can access individual columns of DataFrames by using the $
operator on the variable, followed by the column name:
> women$height
[1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
The result gets returned as a numeric vector. Check its class with class(women$height)
.
If the column name contains spaces, you can wrap it in quotation marks (women$"Female Height"
), but you can also access a column by its numeric position in the list of columns. For example, we know height
is the first column:
> head(women[1])
height
1 58
2 59
3 60
4 61
5 62
6 63
> head(women[[1]])
[1] 58 59 60 61 62 63
The head
function tells R to only return the first six results. This is so we keep our console nice and scroll-free, which is excellent for brief looks into datasets without printing them out in their entirety. (The opposite can be achieved with tail
, which prints out the last six results.)
We can see here that, when accessing the column with single-square-bracket [1]
, we get a DataFrame but with one less column. If, however, we access it with the double-square-bracket [[1]]
, we get a numeric vector of heights. The data returned by using the single bracket stays the same type as the parent data, while the double bracket accessor targets the specific values in that column and returns them in their most rudimentary form, a numeric vector.
We create DataFrames with the data.frame
function:
> men <- data.frame(height = c(50:65), weight = c(150:165))
> head(men)
height weight
1 50 150
2 51 151
3 52 152
4 53 153
5 54 154
6 55 155
Here, we created a sample men
dataset not unlike the women
set from before.
If we want to get just the names of the columns, we use the names
function. We can even assign a value to it and thus change the column names:
> names(men)
[1] 'height' 'weight'
> names(men) <- c('Male Height', 'Male Weight')
> head(men)
Male Height Male Weight
1 50 150
2 51 151
3 52 152
4 53 153
5 54 154
6 55 155
Matrix
Matrices are multi-dimensional vectors. They are like DataFrames, but can only contain values of the same type. They’re created with the matrix
function and need the number of rows and columns as parameters, and the values to place into these slots:
> m <- matrix(nrow = 4, ncol = 5, 1:20)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
What happens if the number of values provided doesn’t match the number of cells?
> m <- matrix(nrow = 4, ncol = 5, 1:25)
Warning message:
In matrix(nrow = 4, ncol = 5, 1:25) :
data length [25] is not a sub-multiple or multiple of the number of rows [4]
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
We get a warning, but the matrix is created anyway, with all the extra values simply discarded. If there are fewer values than cells, the values provided get recycled until the matrix is full:
> m <- matrix(nrow = 4, ncol = 5, 1:16)
Warning message:
In matrix(nrow = 4, ncol = 5, 1:16) :
data length [16] is not a sub-multiple or multiple of the number of columns [5]
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 1
[2,] 2 6 10 14 2
[3,] 3 7 11 15 3
[4,] 4 8 12 16 4
Much as DataFrames have the names
attribute/function, matrices have a dim
(dimension) one. Changing this property can mutate the form of a matrix:
> m <- 1:15
> m
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> dim(m)
NULL
> dim(m) <- c(3,5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
> dim(m)
[1] 3 5
> dim(m) <- c(5,3)
> m
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
> dim(m)
[1] 5 3
> dim(m) <- NULL
> m
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Here, we’ve created a numeric vector of 15 sequential numbers (in such cases, we can omit the c
). The dim
attribute didn’t exist on it, as evident by the dim
function returning NULL, so we changed it by assigning a numeric vector of 3, 5
to it. This resulted in a reshuffling of the elements to fit into the newly constructed matrix. We then changed the dimensions again by inverting the number of columns and rows to 5, 3
, which again produced a different matrix. Finally, nullifying the dimensions produced the numeric vector from the beginning.
It’s also possible to combine/expand/alter vectors, DataFrames and matrices by using cbind
and rbind
:
> v <- 1:5
> x <- 6:10
> bound <- cbind(v, x)
> bound
v x
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> bound <- rbind(v, x)
> bound
[,1] [,2] [,3] [,4] [,5]
v 1 2 3 4 5
x 6 7 8 9 10
Factors
Factors are vectors with labels. This is different from character vectors, or even numeric ones, because they’re what R folks call “self describing”, allowing R’s functions to automatically make more sense of them than they would of the other types. They’re built with the factor
function, which needs to be fed a vector as an argument:
> f <- factor(c('Hello', 'World', 'Hello', 'Annie', 'Hello', 'World'))
> f
[1] Hello World Hello Annie Hello World
Levels: Annie Hello World
> table(f)
f
Annie Hello World
1 3 2
Levels lists out the unique elements in the factor. The table
function grabs a factor and builds a table consisting of the various levels in the factor, and the number of their occurrences in the factor.
I personally haven’t found a use for factors yet in my projects, but I’m learning about them.
Conclusion
In this tutorial, we’ve covered the basic data types in R and the essentials of using RStudio. You’re now armed with all the knowledge you need to start some basic data operations. Remember that all the functions we covered above are fully searchable in the help files with ?function
, where “function” is the function name.
When learning a programming language, it’s customary to teach data types first, logical operators second, and control structures third, before moving into advanced things like functions and classes. But in this tutorial, I believe we’ve laid a decent enough foundation to jump straight into the fire and learn by example.
Frequently Asked Questions (FAQs) about R and RStudio
What are the key differences between R and RStudio?
R is a programming language and a software environment for statistical computing and graphics. It is widely used among statisticians and data miners for developing statistical software and data analysis. On the other hand, RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management. So, R is the engine, while RStudio is the car that allows you to interact with R in a more user-friendly manner.
How can I install RStudio on my computer?
To install RStudio, you first need to install R on your computer. After installing R, you can download RStudio from the official website. Choose the version that matches your operating system (Windows, Mac, or Linux). After downloading, run the installer and follow the instructions.
Is RStudio free to use?
Yes, RStudio provides an open-source version that is free to use. They also offer a commercial version for professional use, which includes additional features and support.
Can I use R without RStudio?
Yes, you can use R without RStudio. RStudio is just an IDE that makes using R easier and more user-friendly. You can run R from the command line or use other IDEs.
What are the main features of RStudio?
RStudio offers a wide range of features that make it easier to use R. These include a console with syntax-highlighting, code completion, and smart indentation. It also provides integrated tools for plotting, debugging, and workspace management. Additionally, it supports direct code execution, allowing you to run R code directly from the source editor.
How can I learn to use R and RStudio?
There are many resources available to learn R and RStudio. These include online tutorials, books, and courses. The official R and RStudio websites also provide comprehensive documentation.
What is the purpose of the RStudio console?
The RStudio console is where you can enter and run R commands directly. It supports syntax-highlighting, code completion, and smart indentation, making it easier to write and debug R code.
Can I use RStudio for data visualization?
Yes, RStudio provides integrated tools for data visualization. You can create a wide range of static, dynamic, and interactive graphics in RStudio.
How can I debug R code in RStudio?
RStudio provides several tools for debugging R code. These include breakpoints, stepping functions, and a debugging console. You can set breakpoints in your code, step through the execution of your code, and inspect the values of variables at each step.
Can I use RStudio for machine learning?
Yes, RStudio supports a wide range of machine learning capabilities. You can use R packages for machine learning in RStudio, and it also supports interfaces to other machine learning software.
Bruno is a blockchain developer and technical educator at the Web3 Foundation, the foundation that's building the next generation of the free people's internet. He runs two newsletters you should subscribe to if you're interested in Web3.0: Dot Leap covers ecosystem and tech development of Web3, and NFT Review covers the evolution of the non-fungible token (digital collectibles) ecosystem inside this emerging new web. His current passion project is RMRK.app, the most advanced NFT system in the world, which allows NFTs to own other NFTs, NFTs to react to emotion, NFTs to be governed democratically, and NFTs to be multiple things at once.