Introduction to R and RStudio

With increased computing power comes increased access to large amounts of freely accessible data. People are tracking their lives with productivity, calorie, fitness and sleep trackers. Governments are publishing survey data left and right, and companies conduct audience testing that needs analyzing. There’s a lot of data out there even now, ready to be grabbed and looked at.

In this tutorial, we’ll look at the basics of the R programming language — a language built solely for statistical computing. I won’t bore you with Wikipedia definitions. Instead, let’s dive right into it. In this introduction, we’ll cover the installation of the default IDE and language, and its data types.

Key Takeaways

R is a programming language specifically designed for statistical computing, and RStudio is its preferred integrated development environment (IDE).
Installation of R and RStudio involves downloading from their respective websites, with both being open-source and free to use.
The RStudio interface is divided into four main areas: the editor, the console, the environment/history, and the miscellaneous panel which includes tabs like files, plots, packages, help, and viewer.
Built-in datasets in RStudio are readily available for practice, and can be loaded using simple commands to explore data manipulation and visualization.
R has several data types including vectors, lists, matrices, data frames, and factors, each serving different purposes in data analysis.
Functions such as `nrow`, `ncol`, `summary`, `str`, and `dim` are essential for data exploration in R, helping users understand dataset dimensions and summary statistics.
For effective learning and usage of R and RStudio, familiarization with the console operations, data types, and basic functions is crucial, as these form the foundation of statistical programming in R.

Installing

R is both a programming language and a software environment, which means it’s fully self-contained. There are two steps to getting it installed:

download and install the latest R: www.r-project.org
download and install RStudio, the R IDE: www.rstudio.com

Both are free, both open source. R will be installed as the underlying engine that powers RStudio’s computations, while RStudio will provide sample data, command autocompletion, help files, and an effective interface for getting things done quickly. You could write R code in simple text files as in most other languages, but that’s really not recommended given how many commands there are and how complex things can quickly get.

After you’ve installed the tools, launch R Studio.

RStudio IDE

IDE Areas

Let’s briefly explain the GUI. There are four main parts. I’ll explain the default order, though note that this can be changed in Settings/Preferences > Pane Layout.

The Editor

The top left quadrant is the editor. It’s where you write R code you want to keep for later — functions, classes, packages, etc. This is, for all intents and purposes, identical to every other code editor’s main window. Apart from some self-explanatory buttons, and others that needn’t concern you at this starting point, there’s also a “Source on Save” checkbox. This means “Load the contents of the file into my console’s runtime every time I save the file”. You should have this on at all times, as it makes your development flow faster by one click.

The Console

The lower left quadrant is the console. It’s a REPL for R in which you can test out your ideas, datasets, filters, and functions. This is where you’ll be spending most of your time in the beginning. Here’s where you verify that an idea you had works before copying it over into the editor above. This is also the environment into which your R files will be sourced on save (see above), so whenever you develop a new function in an R file above, it automatically becomes available in this REPL. We’ll be spending a lot of time in the REPL in the remainder of this tutorial.

History / Environment

History/Environment Pane

The top right quadrant has two tabs: Environment and History.

Environment refers to the console environment (see above) and will list, in detail, every single symbol you defined in the console (whether via sourcing or directly). That is, if you have a function available in the REPL, it will be listed in the environment. If you have a variable, or a dataset, it will be listed there. This is where you can also import custom datasets manually and make them instantly available in the console, if you don’t feel like typing out the commands to do so. You can also inspect the environment of other packages you installed and loaded. (More on packages at a later time.) Go ahead and play around with it; you can’t break anything.

History lists every single console command you executed since the last project started. It’s saved into a hidden .Rhistory file in your project’s folder. If you don’t choose to save your environment after a session, the history won’t be saved.

Misc

Misc pane

The bottom right panel is the misc panel, and contains five separate tabs. The first one, Files, is self-explanatory.

The Plots tab will contain the graphs you generated with R. It’s here that you can zoom, export, configure and inspect your charts and plots.

The Packages tab lets you install additional packages into R. A brief description is next to each available package, though there are many more than those listed there. We’ll go through package repositories in a later post.

The Help tab lets you search the incredibly extensive help directory and will automatically open whenever you call help on a command in the console. (Help is called by prepending a command name with a question mark, like so: ?data.frame.)

Finally, the Viewer is essentially RStudio’s built-in browser. Yes, you can develop web apps with R and even launch locally hosted web apps within it.

Built-in datasets

In the text below, whenever I mention using a command, assume this means punching it into the console. So, if I say “We look at the help for DataFrames with ?data.frame”, you do this:

A gif showing ?data.frame beign run in the console

RStudio comes with some datasets for new users to play around with. To use a built-in dataset, we load it with the data function, and supply an argument corresponding to the set we want. To see all the available built-in sets, punch in data(), without an argument.

Data Sets in RStudio

Looking at the list of available datasets, let’s load a very small one for starters:

data('women')

You should see the women variable appear in the Environment panel, though its second field says <Promise>. A promise in this case merely means “The data will be there when you actually need it”. We told R to load this set, but we haven’t actually used it anywhere, so it didn’t feel the need to load it fully into memory. Let’s tell R we need it. In the console, print out the entire set by simply calling this:

women

This is equivalent to:

print(women)

Note: we’ll be using the former approach, simply because it’s less typing. Remember: in R, the last value that’s typed out without being an expression (like assigning or summing something) is what gets auto-printed to the console.

The numbers will be produced in the console, and the Environment entry for women should change. You should be able to see the data in the environment panel now, too, by clicking the blue expand arrow next to the variable name.

Women dataset, expanded

This set only has 15 entries, and as such offers nothing of value, but it’s good enough for playing around in.

To further study the set you’re dealing with, there are several functions to keep in mind (a demonstration of each can be seen below explanations):

nrow/ncol will list the number of rows/columns respectively.
summary will output a summary about the set’s columns. In the case of the women set, we have two numeric columns (both columns are numeric, or in other words, each column is a numeric vector; more on data types and vectors later). And R knows that, when you ask it for an analysis of a numeric vector, it should give you the typical values for such collections: the minimum value in the set, the mean (average) between the minimum and the mean, the mean (average of all values), the mean between the mean and the maximum, and the maximum, the largest number in the column. It does this for both height and width. For different types of vectors (like ones where every element is a word instead of a number) the output is different.
str is a different kind of summary. In fact, str stands for “structure” and it outputs a summary of a dataset’s structure. In our case, it will tell us that it’s a “data.frame” (a special data type we’ll explain later) with 15 obs (observations or rows) and two variables (or columns). It then proceeds to list all the columns in the DataFrame with some (but not all) of their values, just so we get a grasp on the kind of values we’re dealing with.
dim gives you the dimensions of a dataset. Calling dim(women) gives us 15 2, which means 15 rows and two columns. length can be used to count the number of vertical elements in a set. In vectors (see below), this is the number of elements; in data sets like women, this is the number of columns:

> nrow(women)
[1] 15
> ncol(women)
[1] 2
> summary(women)
     height         weight     
 Min.   :58.0   Min.   :115.0  
 1st Qu.:61.5   1st Qu.:124.5  
 Median :65.0   Median :135.0  
 Mean   :65.0   Mean   :136.7  
 3rd Qu.:68.5   3rd Qu.:148.0  
 Max.   :72.0   Max.   :164.0  
> str(women)
'data.frame':    15 obs. of  2 variables:
 $ height: num  58 59 60 61 62 63 64 65 66 67 ...
 $ weight: num  115 117 120 123 126 129 132 135 139 142 ...
> dim(women)
[1] 15  2

You’ll be using these functions a lot, so I recommend you get familiar with them. Load some of the other datasets and inspect them like this. There’s no need to know them by heart. This tutorial and the help files will always be around for reference, but it’s nice to be fluent in them anyway.

Data Types

R has some typical atomic data types you already know about from other languages, but it also provides some more statistics-inclined ones. Let’s briefly go through them. While explaining these types, I’ll talk about assigning them. Assigning in R is done with the “left arrow” operator or <-, as in:

myString <- 'Hello, World!'

R is, however, very forgiving and will let you use the = assignment operator in top level environments like the console, if you don’t feel like typing out the arrow every time:

myString = 'Hello, World!'

I suggest you get used to the arrow, though, as you won’t get very far without it.

To check the type (or class) of a variable, the class function can be used (though str from above does almost the same thing): class(myString).

Atomics

Atomic classes are basic types from which others are constructed.

Character

The character class is your typical string, a set of one or more letters:

> myString <- 'Hello, World!'
> class(myString)
[1] 'character'

The [1] will be explained below, in the “Vectors” section.

Numeric

The numeric class corresponds to float in other languages. It indicates numeric values like 10, 15.6, -48792.5498982749879 and so on:

> myNum <- 5.983904798274987298
> class(myNum)
[1] 'numeric'

You can coerce (change type of) numeric string values into numeric types, like so:

> myString <- '5.60'
> class(myString)
[1] 'character'
> myNumber <- as.numeric(myString)
> myNumber
[1] 5.6
> class(myNumber)
[1] 'numeric'

There’s also a special number Inf which represents infinity. It can be used in calculations:

> 1/0
[1] Inf

Another “number” is NaN, which stands for “Not a Number”. This is what you get when you do something like 0/0.

Integer

Integers are whole numbers, though they get autocoerced (changed) into numerics when saved into variables:

> myInt <- 209173987
> class(myInt)
[1] 'numeric'

To actually force them to be integers, we need to invoke a function that manually coerces them, called as.integer:

> myInt <- as.integer(myInt)
> class(myInt)
[1] 'integer'

You can prevent autocoercion by setting integers with an L suffix:

> myInt = 5L
> class(myInt)
[1] 'integer'

Note that, if you give R a number that’s greater than what its memory can store, it autocoerces it into a real number, even if you put L at the end:

> myInt <- 2479827498237498723498729384
> class(myInt)
[1] 'numeric'
> myInt
[1] 2.479827e+27

But if you then try to coerce that number into an integer, R will discard it because it simply can’t make integers that big. Instead of a number, you get “NA”, which is a special type in R indicating “Not Available”, also known as a missing value:

> myIntCoerced <- as.integer(myInt)
Warning message:
NAs introduced by coercion
> myIntCoerced
[1] NA
> class(myIntCoerced)
[1] 'integer'

The NA is still a type of “integer”, but one without value.

Note that, when coercing numerics into integers, decimal places get lost. The same applies to coercing from numeric decimal strings:

> myString <- '5.60'
> myNumeric <- 5.6
> myInteger1 <- as.integer(myString)
> myInteger2 <- as.integer(myNumeric)
> myInteger1 == myInteger2
[1] TRUE
> myInteger1
[1] 5

Complex

Explaining complex numbers is a bit outside the scope of this tutorial, particularly if you weren’t exposed to them in school, but if you’re curious, you can find out more about complex numbers on Wikipedia. They take the form of a + bi where a and b are real numbers and i is imaginary. In R, they’re constructed with a special complex function:

> myComplex <- complex(1, 3292, 8974892)
> myComplex
[1] 3292+8974892i
> class(myComplex)
[1] 'complex'

You won’t be needing those nearly as often as you might need the other types, but if you want to know more about the complex function, just call for help on it: ?complex.

Logical

Logical types (booleans) are the same as in most other languages and can be two things: either true, or false. True can be represented with TRUE or T, while false is, predictably, FALSE or F:

> TRUE == T
[1] TRUE
> myBool <- TRUE
> myBool == T
[1] TRUE
> myComparison <- 5 > 6
> myComparison == FALSE
[1] TRUE
> class(myComparison)
[1] 'logical'

Whenever you create an expression, the result of which is a “yes” or “no” value, you get a TRUE or FALSE — like in the case of 5 > 6 above. 5 is not greater than 6, so the expression becomes FALSE. Comparing myComparison to FALSE thus yields TRUE because the myComparison variable indeed contains a value of FALSE.

When needed by a function, logical values will be coerced into numerics. This means that, if I write 1 + TRUE, the console will produce 2, whereas 1 + FALSE gives 1. Likewise, we can easily coerce other types into logicals (as.logical(myVariable)). Any numeric or integer with a value not equal to 0 or NA will give TRUE. 0 and 0L will give FALSE. Strings like “True”, “TRUE”, “true” and “T” will be turned into TRUE, “False”, “FALSE”, “false” and “F” will be FALSE. Any other string will coerce into a logical NA value.

Higher Types

Higher types are types composed of the lower ones.

Vectors and Lists

The most essential of all, the vector, is a collection of elements of the same type. In our earlier dataset example, one column of the women dataset was a numeric vector, meaning it was a collection with only numeric values in it. A vector can only have elements of the exact same type. Vectors can be created with the vector function, but are usually created with the shorthand c (concatenate) function:

> myVector <- c('Hello', 'World', 'Third Element')
> class(myVector)
[1] 'character'
> myVector
[1] 'Hello'         'World'         'Third Element'

We can see here that the vector’s class is “character”, meaning that it contains only character type values. If we print it out by just calling its variable name, we get all three elements and a [1]. The [1] means literally: “I am outputting the contents of your vector. The first element on this line is element number 1 in the vector”.

What you see in [] depends entirely on the size of your console panel and the length of the array. For example:

> myVector <- c('One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight', 'Nine', 'Ten', 'Eleven', 'Twelve', 'Thirteen', 'Fourteen', 'Fifteen')
> myVector
 [1] 'One'      'Two'      'Three'    'Four'     'Five'     'Six'      'Seven'   
 [8] 'Eight'    'Nine'     'Ten'      'Eleven'   'Twelve'   'Thirteen' 'Fourteen'
[15] 'Fifteen'

The number in square brackets simply means “The element after me is Nth in the set”. It’s there purely to make the output more readable, and doesn’t affect actual data.

Note that vectors are strictly one-dimensional. You can’t add another vector as an element inside an existing vector; their elements get merged into one:

> v1 <- c('a', 'b', 'c')
> v2 <- c('d', 'e', 'f')
> v3 <- c(v1, v2)
> v3
[1] 'a' 'b' 'c' 'd' 'e' 'f'

You can generate entire numeric vectors by specifying a range:

> myRange <- c(1:10)
> myRange
 [1]  1  2  3  4  5  6  7  8  9 10

Lists are just like vectors, only they don’t have the limitation of being able to hold elements of the same type exclusively. They’re built with the list function or with the c function if one of the elements you’re adding is a list:

> myList <- list(5, 'Hello', 'Worlds', TRUE)
> class(myList)
[1] 'list'
> myList
[[1]]
[1] 5

[[2]]
[1] 'Hello'

[[3]]
[1] 'Worlds'

[[4]]
[1] TRUE

As with vectors, the one-dimensionality rule applies. Adding a list into another will merge their elements.

The [[N]] output means “The first element of this list is a vector with a single element 5, the second element of this list is a vector with a single element Hello… etc.” Appending the [[N]] to the variable that holds the list actually returns the element. We can check this easily:

> class(myList[[1]])
[1] 'numeric'
> class(myList[[2]])
[1] 'character'
> class(myList[[3]])
[1] 'character'
> class(myList[[4]])
[1] 'logical'

Data.Frame

DataFrames are essentially tables with rows and columns, much like spreadsheets. The women dataset we loaded above was a DataFrame. You can access individual columns of DataFrames by using the $ operator on the variable, followed by the column name:

> women$height
 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72

The result gets returned as a numeric vector. Check its class with class(women$height).

If the column name contains spaces, you can wrap it in quotation marks (women$"Female Height"), but you can also access a column by its numeric position in the list of columns. For example, we know height is the first column:

> head(women[1])
  height
1     58
2     59
3     60
4     61
5     62
6     63
> head(women[[1]])
[1] 58 59 60 61 62 63

The head function tells R to only return the first six results. This is so we keep our console nice and scroll-free, which is excellent for brief looks into datasets without printing them out in their entirety. (The opposite can be achieved with tail, which prints out the last six results.)

We can see here that, when accessing the column with single-square-bracket [1], we get a DataFrame but with one less column. If, however, we access it with the double-square-bracket [[1]], we get a numeric vector of heights. The data returned by using the single bracket stays the same type as the parent data, while the double bracket accessor targets the specific values in that column and returns them in their most rudimentary form, a numeric vector.

We create DataFrames with the data.frame function:

> men <- data.frame(height = c(50:65), weight = c(150:165))
> head(men)
  height weight
1     50    150
2     51    151
3     52    152
4     53    153
5     54    154
6     55    155

Here, we created a sample men dataset not unlike the women set from before.

If we want to get just the names of the columns, we use the names function. We can even assign a value to it and thus change the column names:

> names(men)
[1] 'height' 'weight'
> names(men) <- c('Male Height', 'Male Weight')
> head(men)
  Male Height Male Weight
1          50         150
2          51         151
3          52         152
4          53         153
5          54         154
6          55         155

Matrix

Matrices are multi-dimensional vectors. They are like DataFrames, but can only contain values of the same type. They’re created with the matrix function and need the number of rows and columns as parameters, and the values to place into these slots:

> m <- matrix(nrow = 4, ncol = 5, 1:20)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

What happens if the number of values provided doesn’t match the number of cells?

> m <- matrix(nrow = 4, ncol = 5, 1:25)
Warning message:
In matrix(nrow = 4, ncol = 5, 1:25) :
  data length [25] is not a sub-multiple or multiple of the number of rows [4]
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20

We get a warning, but the matrix is created anyway, with all the extra values simply discarded. If there are fewer values than cells, the values provided get recycled until the matrix is full:

> m <- matrix(nrow = 4, ncol = 5, 1:16)
Warning message:
In matrix(nrow = 4, ncol = 5, 1:16) :
  data length [16] is not a sub-multiple or multiple of the number of columns [5]
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13    1
[2,]    2    6   10   14    2
[3,]    3    7   11   15    3
[4,]    4    8   12   16    4

Much as DataFrames have the names attribute/function, matrices have a dim (dimension) one. Changing this property can mutate the form of a matrix:

> m <- 1:15
> m
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
> dim(m)
NULL
> dim(m) <- c(3,5)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    7   10   13
[2,]    2    5    8   11   14
[3,]    3    6    9   12   15
> dim(m)
[1] 3 5
> dim(m) <- c(5,3)
> m
     [,1] [,2] [,3]
[1,]    1    6   11
[2,]    2    7   12
[3,]    3    8   13
[4,]    4    9   14
[5,]    5   10   15
> dim(m)
[1] 5 3
> dim(m) <- NULL
> m
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

Here, we’ve created a numeric vector of 15 sequential numbers (in such cases, we can omit the c). The dim attribute didn’t exist on it, as evident by the dim function returning NULL, so we changed it by assigning a numeric vector of 3, 5 to it. This resulted in a reshuffling of the elements to fit into the newly constructed matrix. We then changed the dimensions again by inverting the number of columns and rows to 5, 3, which again produced a different matrix. Finally, nullifying the dimensions produced the numeric vector from the beginning.

It’s also possible to combine/expand/alter vectors, DataFrames and matrices by using cbind and rbind:

> v <- 1:5
> x <- 6:10
> bound <- cbind(v, x)
> bound
     v  x
[1,] 1  6
[2,] 2  7
[3,] 3  8
[4,] 4  9
[5,] 5 10
> bound <- rbind(v, x)
> bound
  [,1] [,2] [,3] [,4] [,5]
v    1    2    3    4    5
x    6    7    8    9   10

Factors

Factors are vectors with labels. This is different from character vectors, or even numeric ones, because they’re what R folks call “self describing”, allowing R’s functions to automatically make more sense of them than they would of the other types. They’re built with the factor function, which needs to be fed a vector as an argument:

> f <- factor(c('Hello', 'World', 'Hello', 'Annie', 'Hello', 'World'))
> f
[1] Hello World Hello Annie Hello World
Levels: Annie Hello World
> table(f)
f
Annie Hello World
    1     3     2

Levels lists out the unique elements in the factor. The table function grabs a factor and builds a table consisting of the various levels in the factor, and the number of their occurrences in the factor.

I personally haven’t found a use for factors yet in my projects, but I’m learning about them.

Conclusion

In this tutorial, we’ve covered the basic data types in R and the essentials of using RStudio. You’re now armed with all the knowledge you need to start some basic data operations. Remember that all the functions we covered above are fully searchable in the help files with ?function, where “function” is the function name.

When learning a programming language, it’s customary to teach data types first, logical operators second, and control structures third, before moving into advanced things like functions and classes. But in this tutorial, I believe we’ve laid a decent enough foundation to jump straight into the fire and learn by example.

Frequently Asked Questions (FAQs) about R and RStudio

What are the key differences between R and RStudio?

R is a programming language and a software environment for statistical computing and graphics. It is widely used among statisticians and data miners for developing statistical software and data analysis. On the other hand, RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management. So, R is the engine, while RStudio is the car that allows you to interact with R in a more user-friendly manner.

How can I install RStudio on my computer?

To install RStudio, you first need to install R on your computer. After installing R, you can download RStudio from the official website. Choose the version that matches your operating system (Windows, Mac, or Linux). After downloading, run the installer and follow the instructions.

Is RStudio free to use?

Yes, RStudio provides an open-source version that is free to use. They also offer a commercial version for professional use, which includes additional features and support.

Can I use R without RStudio?

Yes, you can use R without RStudio. RStudio is just an IDE that makes using R easier and more user-friendly. You can run R from the command line or use other IDEs.

What are the main features of RStudio?

RStudio offers a wide range of features that make it easier to use R. These include a console with syntax-highlighting, code completion, and smart indentation. It also provides integrated tools for plotting, debugging, and workspace management. Additionally, it supports direct code execution, allowing you to run R code directly from the source editor.

How can I learn to use R and RStudio?

There are many resources available to learn R and RStudio. These include online tutorials, books, and courses. The official R and RStudio websites also provide comprehensive documentation.

What is the purpose of the RStudio console?

The RStudio console is where you can enter and run R commands directly. It supports syntax-highlighting, code completion, and smart indentation, making it easier to write and debug R code.

Can I use RStudio for data visualization?

Yes, RStudio provides integrated tools for data visualization. You can create a wide range of static, dynamic, and interactive graphics in RStudio.

How can I debug R code in RStudio?

RStudio provides several tools for debugging R code. These include breakpoints, stepping functions, and a debugging console. You can set breakpoints in your code, step through the execution of your code, and inspect the values of variables at each step.

Can I use RStudio for machine learning?

Yes, RStudio supports a wide range of machine learning capabilities. You can use R packages for machine learning in RStudio, and it also supports interfaces to other machine learning software.