A Guide to the Ruby CSV Library, Part II

Darko Gjorgjievski
Tweet

65vHPcY

Good job, son! – Uncle Bob (if you’re confused who Uncle Bob is you might want to read Part 1 of this tutorial)

Uncle Bob is happy with the work we’ve done so far. But there’s one thing left…Remember this file?

New employees are confused, what does each column mean? What is 34 or 2548? What we need to do is add headers to our CSV file. Let’s update our file (below is the plain text format of the above picture with headers included):

Name,Times arrived,Total $ spent,Food feedback
Dan,34,2548,Lovin it!
Maria,55,5054,"Good, delicious food"
Carlos,22,4352,"I am ""pleased"", but could be better"
Stephany,34,6542,I want bigger steaks!!!!!

Now, let’s go and import our file into Ruby (suppose its name is ‘guests.csv’):

require 'csv' # since CSV is part of the standard library, we do need to 'require' it, remember?
guests = CSV.read('guests.csv') # read the entire file into a 'guests' variable

The CSV library does not possess magic powers so it has no idea that the first row here is a header. Therefore, if we run the code above , the “guests” variable will be a double array of 5 array elements, the first one being the ‘header’ treated just like the rest of the rows:

["Name", "Times arrived", "Total $ spent", "Food feedback"]

Let’s tell the CSV library to NOT treat the first row like any of the rest. How do we do this? Remember, every CSV method that makes something “csv-like” in Ruby can be given a list of options (we’ve used converters: numeric in the first part) which basically instructs the library to treat the file (or the string) differently than how it treats it by default. Among the many options, there’s one to let the CSV library know that the first row is a header, which is :headers => true. Let’s modify our second line in our 2-line code above by adding this option:

guests = CSV.read('guests.csv', headers:true)

So, you might wonder, what’s different now? Let’s try and inspect what the guests variable currently contains (hint: it isn’t a double array anymore!):

#<CSV::Table mode:col_or_row row_count:5>

Don’t worry about what all of this means for now, we’ll explain it step-by-step below.

Files with Headers are Special

Instead of a (double) array, our guests variable is now a CSV::Table type of object. As you probably know, every object in Ruby has its own separate methods. The same is true for CSV::Table.

Fortunately, this new object type behaves very similarly to our double array. Just like you could iterate over the double array to get each row with each, you can use the same method for CSV::Table to achieve the same thing:

guests.each do |guest_row|
  p guest_row #<CSV::Row "Name:"Dan"...
end

You should see 4 lines as the output (yep 4, instead of 5, it now recognizes that our first row is an actual header). Here are the first 2 rows:

#<CSV::Row "Name":"Dan" "Times arrived":"34" "Total $ spent":"2548" "Food feedback":"Lovin it!">
#<CSV::Row "Name":"Maria" "Times arrived":"55" "Total $ spent":"5054" "Food feedback":"Good, delicious food">

As you can see, instead of getting a bunch of arrays, you now get a new CSV::Row. Each of these objects, as its name says, is referencing one row as well.

Basically, if we include headers:true we get :

  • a CSV::Table instead of a double array and
  • CSV::Row objects representing the rows.

If you want to get the double array back, just call the CSV::Table#to_a method. Let’s see what happens:

guests = CSV.read('guests.csv', headers:true) #=> <CSV::Table mode:col_or_row row_count:5>
p guests.to_a # will output...
#[["Name", "Times arrived", "Total $ spent", "Food feedback"], ["Dan", "34", "2548", "Lovin it!"]...........

You can also use #to_s instead of to_a to get a string representation of the file:

p guests.to_s
#=> "Name,Times arrived,Total $ spent,Food feedback\nDan,34,2548.........

Why the Different Objects?

What’s the point of having these weird CSV::Row type of objects for every CSV row instead of a good ol’ array? Let’s take Dan, our second row from the guests.csv file, as an example. Without any options, Dan would be represented in this format in Ruby:

["Dan", "34", "2548", "Lovin it!"]

Things change when we add the headers:true option:

#<CSV::Row "Name":"Dan" "Times arrived":"34" "Total $ spent":"2548" "Food feedback":"Lovin it!">

Imagine you wanted to print only the first column of guests.csv, containing the guest names. If Dan is represented as an ordinary array, you’ll have to call dan_array[0] to get that data. Our code would look something like this (this time I’ll use the foreach method to read from the CSV file instead of read):

guests = CSV.foreach('guests.csv') do |row|
  puts row[0]
end #=> outputs "Name", "Dan", "Maria", "Carlos", "Stephany" on separate lines

If someone else was reading your code, he would probably wonder what row[0] means. Also, we have ‘Name’ as the first line in the output which is definitely NOT what we want. We want to print the names of the guests without the headers! Better yet, use the headers as pointers to a specific cell under that column in the row. Witness the power of CSV::Row:

guests = CSV.foreach('guests.csv', headers:true) do |row|
  puts row['Name'] # For each row, give me the cell that is under the 'Name' column
end #=> outputs "Dan", "Maria", "Carlos", "Stephany" on separate lines. "Name" is not printed.

Perfect! CSV::Row has its own convenient methods, as you can see. In this case, the second row is just syntactic sugar for row.[]('Name') which will loop through all rows and output only those cells that are under the ‘Name’ column. You can replace ‘Name’ with ‘Times arrived’, ‘Total $ spent’, or ‘Food feedback’ to get the corresponding values from the other columns. One important thing to remember is that the header names are case sensitive. This will not work:

guests = CSV.foreach('guests.csv', headers:true) do |row|
  puts row['Food Feedback'] #=> Will print nil for all rows, the correct column name is 'Food feedback'
end

If I had to describe CSV::Row, I’d describe it as the love child of an array and a hash. Unlike with an array, you can reference its elements by name and, unlike a hash, you can have duplicate “keys” as we’ll see later on.

If we want to push another item to the CSV::Table object, we can use the same method we’d use with a regular array (either push or < < ) with an array as the argument. To get only the headers, we use the headers method. Here are some examples:

guests = CSV.read('guests.csv',headers:true) #<CSV::Table mode:col_or_row row_count:5>
guests << ['Eve', 24, 54, 'Delicious'] #<CSV::Table mode:col_or_row row_count:6>
print guests.headers #=> ["Name", "Times arrived", "Total $ spent", "Food feedback"]

The CSV::Table#delete method will delete an entire column from the CSV::Table object and return the deleted entries as an array.

guests.delete('Name') #=> returns ["Dan", "Maria", "Carlos", "Stephany"]

What if we wanted to delete a row instead of an array? We can use the same .delete method, this time, providing a numeric index (0 based, so 0=the first row, 1=the second row). The return value, just like with the above example, will be the deleted row.

guests.delete(0)
# This method returns #<CSV::Row "Name":"Dan" " Times arrived":"34" "Total $ spent":"2548" "Food feedback":"Lovin it!">

Here’s where things get a bit weird if you’re familiar with arrays and hashes. When you have CSV::Table in a variable, to access the values in a column you use the column name as the index and to access a row you use numeric values starting from 0 (0 – first row, 1 – second row etc.):

guests = CSV.read('guests.csv', headers:true)
guests['Times arrived'] #=> ["34", "55", "22", "34"]
guests[0] #=> #<CSV::Row "Name":"Carlos" "Times arrived":"22" "Total $ spent":"4352" "Food feedback":"I am \"pleased\", but could be better">

We can even combine these 2 notations access only a specific cell:

guests['Times arrived'][1] #=> Returns '55'
guests[1]['Times arrived'] #=> Also returns 55

What you’re telling Ruby with this code is: “Give me the value of the cell which is under the “Times arrived” column on the second row”. Neat!

Practical Use of CSV Headers

Let’s try taking our guests.csv file and changing the “Total $ spent” column so it contains decimals instead of whole numbers:

new_guests_csv = [] # We create an array to gold the new CSV data
CSV.foreach('guests.csv',headers:true) do |guest| # Iterate over each row of our CSV file
  guest['Total $ spent'] = guest['Total $ spent'].to_f #
  new_guests_csv << guest # Add the new row into new_guests_csv
end

I’ve decided to use the .foreach method to read from the file here. I could have also used CSV#read as well:

new_guests_csv  = [] # We create an array to gold the new CSV data
old_guests_csv = CSV.read('guests.csv', headers:true) # Reads the entire content of the CSV into the variable
old_guests_csv.each do |guest| # old_guests_csv is CSV::Table object which has #each method to iterate over its rows
  guest['Total $ spent'] = guest['Total $ spent'].to_f # Same thing as with our previous code
  new_guests_csv << guest # Add the new row into new_guests_csv
end

We could do this if we wanted to save our data in a new file:

CSV.open('updated_guests.csv', 'w') do |csv| # Create a new file updated_guests.csv
  csv << ['Name', 'Times arrived', 'Total $ spent (decimals)', 'Food feedback'] # Add new headers 
  new_guests_csv.each do |row| 
    # Since we now have the entire updated CSV file in this variable as a double array,
    # we iterate over each (array) element

    csv.puts row
  end
end

Often, when writing data to a new CSV file, you’ll want to change your header names. What most people do is, first, add the new headers (like we did on the second line above) and, then, loop through the modified variable (in our case new_guests_csv) and add each row separately.

Quick tip: What if you had a file with 2 similar column names? Suppose our guests.csv file contained another ‘Name’ column, how would we access it? As you already know, the first column can be accessed with row_object['Name']. To obtain access to the second one, write row_object['Name', 2]

How to Read a BIG CSV File Without Losing Your Sanity

Ruby is a language primarily optimized for people, not computers. Thus, performance is secondary and convenience is first on the priority list. Unfortunately, this can be a problem when you’re trying something like reading a big file into memory. If your file is 300 megabytes and the machine that’s running your code has, say, 512MB of RAM, there’s going to be a problem.

Let’s get back to Uncle Bob. Fast forward into the future, Uncle Bob has a franchise and a multi-million dollar business. He still keeps his CSV file though, although now there’s over 70.000 rows! If we try to read and manipulate this file with CSV.read (which is going to read the ENTIRE file into memory), our program size in RAM will grow severalfold. This happens because there’s an object in memory (an array of CSV::Row if we have headers) for every single row our CSV file. That’s 70.000 objects! Let’s say we want to return only those rows with people who arrived more than 10 times at the restaurant:

#...doing some regular stuff, the program memory size is 12MB
guests = CSV.read('guests.csv',headers:true) # a big CSV file is read into memory, the size is now over 100MB!

guests.select do |row| 
  row['Times arrived'].to_i > 10  # if we put the return value of the select block in another variable, we'll have even bigger size
end

Sure, we can avoid this with using CSV.foreach which instead of reading the entire file into memory, will iterate row-by-row and and won’t affect the overall program memory size after it finishes its iteration. But there’s a slight problem here, we’ll have to re-structure our program:

guests_who_visited_more_than_ten_times = Array.new
CSV.foreach('guests.csv', headers:true) do |guest|
  guests_who_visited_more_than_ten_times << guest if guest['Times arrived'] > 10
end

Here we trade expressiveness for efficiency, the first example where we’ve read from memory was better because our data was an actual object on which we could use Array#select. Fortunately, you can have best of both worlds, saving both memory and expressiveness using an enumerator (props to Avdi from RubyTapas where I’ve learned this awesome trick):

CSV.open('guests.csv', headers:true) do |guest|
  guests = guest.each
  guests.select do |row| 
    row['Times arrived'].to_i > 10
  end
end

In case you’re not familiar with the second line, we basically create an enumerator object and store it into the guests variable. Enumerators allow us to iterate “on-demand”, say, if I had code that counted from 1 to 100:

1.upto(10) { |x| p x }

Turning this into an enumerator object will allow to me to, for example, count from 1 to 5 in one part of my program and then, later on, finish the counting to 10:

enum = 1.upto(10)
p enum.next #=> 1
p 'bunch of other code here'
p enum.next #=> 2
p 'and here'
p enum.next #=> 3

The ‘on-demand’ nature by enumerators makes them memory-friendly. Compared to the first example with using CSV.read, we won’t have thousands of objects into RAM, and unlike CSV.foreach, we don’t have to re-factor the code by modifying the logic.

Well, this is the end of the Ruby CSV series. I hope you’ve learned a lot!

Free book: Jump Start HTML5 Basics

Grab a free copy of one our latest ebooks! Packed with hints and tips on HTML5's most powerful new features.

  • Vimal Raj

    Very useful topic… I love it!

  • cubsker

    awesome stuff, Darko. Really appreciate these two CSV articles.

  • Илья Беринчик

    Awesome! Thanks!