Data migrations are the pinnacle of any web development project. It feels like climbing to the top of a mountain just to look back down and jump. On a typical project, you can test everything against the new schema. Throw caution to the wind and make radical changes at a hundred miles per hour. But, what about legacy data that is still in production?
In this article, I’ll delve deep into what it takes to complete a successful data migration in Oracle. Modern web developers have many tools at their disposal. Here, I’ll explain why data migration scripts can be a joy to work with in Ruby. I’ll be dealing with Oracle instead of migrations in Rails.
I will be using OCI8 in my examples, which is an Oracle adapter for Ruby. It allows you write Ruby code to talk to your Oracle database server.
To begin, data migrations involve a database normalization. This Wikipedia article explains:
Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy. Normalization usually involves dividing large tables into smaller (and less redundant) tables and defining relationships between them. The objective is to isolate data so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database using the defined relationships.
I have just completed a project in which my legacy database was full of duplicate data. This led to data quality issues and some strange code. A bloated and ugly database guarantees code that is full of wrong and weird hacks. In short, my project was in dire need of a beauty salon.
Setup
Let’s define a couple of tables:
CREATE TABLE answer (
answer_id NUMBER PRIMARY KEY,
question_id NUMBER,
answer_sequence NUMBER,
answer_type VARCHAR2(7),
answer_text VARCHAR2(31),
updt_date DATE
);
CREATE TABLE question (
question_id NUMBER PRIMARY KEY,
question_text VARCHAR2(31),
placement NUMBER,
updt_date DATE
);
Looking at this example, we are dealing with your basic questionnaire. Every answer belongs to a question and answers have a specific type. Answers can either be radioboxes or checkboxes depending on the question.
This is what our data looks like.
First question:
INSERT INTO question (
question_id, -- why not just id?
question_text, -- question gets implied
placement, -- good, I like it!
updt_date -- where is create date?
)
VALUES (
1,
'A question',
1,
sysdate
);
INSERT INTO answer (
answer_id,
question_id,
answer_sequence, -- how about placement?
answer_type,
answer_text,
updt_date
)
VALUES (
1,
1,
1,
'MLT', -- duplicate
'First answer',
sysdate
);
INSERT INTO answer (
answer_id,
question_id,
answer_sequence,
answer_type,
answer_text,
updt_date
)
VALUES (
2,
1,
2,
'MLT', -- duplicate
'Second answer',
sysdate
);
INSERT INTO answer (
answer_id,
question_id,
answer_sequence,
answer_type,
answer_text,
updt_date
)
VALUES (
3,
1,
3,
'MULTI', -- what? that’s, messed up!
'Third answer',
sysdate
);
Second question:
INSERT INTO question (
question_id,
question_text,
placement,
updt_date
)
VALUES (
2,
'Another question',
2,
sysdate
);
INSERT INTO answer (
answer_id,
question_id,
answer_sequence,
answer_type,
answer_text,
updt_date
)
VALUES (
4,
2,
1,
'CHK', -- duplicate
'First answer',
sysdate
);
INSERT INTO answer (
answer_id,
question_id,
answer_sequence,
answer_type,
answer_text,
updt_date
)
VALUES (
5,
2,
2,
'CHK', -- duplicate
'Second answer',
sysdate
);
Let’s see that data:
SQL> SELECT answer_id, question_id, answer_sequence, answer_type, updt_date FROM answer;
ANSWER_ID QUESTION_ID ANSWER_SEQUENCE ANSWER_ UPDT_DATE
---------- ----------- --------------- ------- ---------
1 1 1 MLT 23-JUN-14
2 1 2 MLT 23-JUN-14
3 1 3 MULTI 23-JUN-14
4 2 1 CHK 23-JUN-14
5 2 2 CHK 23-JUN-14
SQL> SELECT question_id, placement, updt_date FROM question;
QUESTION_ID PLACEMENT UPDT_DATE
----------- ---------- ---------
1 1 23-JUN-14
2 2 23-JUN-14
Yuck! I have data duplication, inconsistent and unimaginative column names and half baked time stamps. As a direct consequence, my data now has nasty data quality issues. Oh boy.
We can do better:
CREATE TABLE answers (
id NUMBER CONSTRAINT answers_pk PRIMARY KEY,
question_id NUMBER,
placement NUMBER,
text VARCHAR2(31),
created_at DATE,
updated_at DATE
);
CREATE TABLE questions (
id NUMBER CONSTRAINT questions_pk PRIMARY KEY,
type_id NUMBER,
placement NUMBER,
text VARCHAR2(31),
created_at DATE,
updated_at DATE
);
CREATE TABLE question_types (
id NUMBER CONSTRAINT question_types_pk PRIMARY KEY,
name VARCHAR(7),
created_at DATE,
updated_at DATE
);
My data model here is heavily influenced by Rails. Answer types get ripped completely out of the answer table and placed in question. Instead of putting repeated question type names in the table, I have placed it in a separate table and added a relationship. Column names are simple and intuitive.
Pitfalls
Presumably you have working code against the new schema, along with passing tests. Let’s now turn our attention towards the production data.
Be sure to install the gem:
gem install ruby-oci8
Let’s start with this simple migration script:
require 'oci8'
src = OCI8.new('username/password@schema')
tgt = OCI8.new('username/password@schema')
tgt.autocommit = false
src.exec("SELECT question_id,
(
SELECT answer.answer_type
FROM answer
WHERE answer.question_id = question.question_id
AND ROWNUM = 1
) question_type,
question_text,
placement,
updt_date
FROM question") do |qr|
src.exec("SELECT answer_id,
answer_sequence,
answer_text,
updt_date
FROM answer
WHERE question_id = :1", qr[0].to_i) do |ar|
puts ar
end
end
src.logoff
tgt.logoff
OCI8 is not intuitive in its error messages, so I recommend incremental steps. Here, I am querying the src
data to make sure everything works. Notice how I can pass in parameters to the exec
method using basic Oracle. I convert types to Ruby types to make sure nothing gets misinterpreted in OCI8. The tgt.autocommit = false
lets me test out my migration code until I’m ready to commit changes. The .logoff
invocation gracefully kills the connection to Oracle. Since I am attaching answer types to questions, I must sub query it.
More Examples
With this sound foundation in place, let’s start populating the questions
table. Assuming we are inside the qr
callback:
QUESTION_TYPE_IDS = { "CHK" => 1, "MLT" => 2 }
tgt.exec("SELECT questions_seq.nextval FROM dual") do |qid|
tgt.exec("INSERT INTO questions (
id,
type_id,
placement,
text,
created_at,
updated_at
)
VALUES (
:1,
:2,
:3,
:4,
:5,
:6
)",
qid[0].to_i,
QUESTION_TYPE_IDS[qr[1]].to_i,
qr[3].to_i,
qr[2].to_s[0..30],
qr[4].to_date,
qr[4].to_date)
end
end
I am using a simple hash called QUESTION_TYPE_IDS
to transform question types into ids. Oracle enforces sequences for primary keys, which forces me to place the INSERT
code inside the qid
callback. In my hashes, I am type casting to make sure no nil
values creep into the insert. OCI8 will throw nasty and incomprehensible errors if nothing matches the hash.
For the question text, the VARCHAR2
types has a max length we don’t want to exceed. To do this, just convert to a string and do [0..30]
to truncate the value. This works even if the string is shorter than the limit.
To finish, let’s focus our attention towards the answers
table. This block of code goes inside the ar
callback.
tgt.exec("SELECT answers_seq.nextval FROM dual") do |aid|
tgt.exec("INSERT INTO answers (
id,
question_id,
placement,
text,
created_at,
updated_at
)
VALUES (
:1,
:2,
:3,
:4,
:5,
:6
)",
aid[0].to_i,
qid[0].to_i,
ar[1].to_i,
ar[2].to_s[0..30],
ar[3].to_date,
ar[3].to_date)
end
This is much of what we saw in the questions
migration. Here, we are mapping the question_id
from qid
from the questions block. This way, my data migration respects the relationships between objects.
With all this out the way, flip the switch and watch the magic unfold.
tgt.autocommit = true
Now for the most difficult feat yet: sit back, throw your hands behind your head, and let it do its thing. For my particular project, this migration took over 15 minutes. It felt like I was plunging into the depths of the abyss, wroght with unspeakable anxiety, time ceasing to exist. But, it completed successfully and I landed safely on the other side.
Wrapping Up
Time to check our final answers
and questions
tables and see how everything looks. I’ll omit VARCHAR2
fields for simplicity. I’m using sqlplus
in Oracle:
SQL> SELECT id, question_id, placement, created_at, updated_at FROM answers;
ID QUESTION_ID PLACEMENT CREATED_A UPDATED_A
---------- ----------- ---------- --------- ---------
1 1 1 10-JUN-14 10-JUN-14
2 1 2 10-JUN-14 10-JUN-14
3 1 3 10-JUN-14 10-JUN-14
4 2 1 10-JUN-14 10-JUN-14
5 2 2 10-JUN-14 10-JUN-14
SQL> SELECT id, type_id, placement, created_at, updated_at FROM questions;
ID TYPE_ID PLACEMENT CREATED_A UPDATED_A
---------- ---------- ---------- --------- ---------
1 2 1 10-JUN-14 10-JUN-14
2 1 2 10-JUN-14 10-JUN-14
Beautiful.
If interested, you may download all code samples from GitHub.
Happy Hacking!
Husband, father, and software engineer from Houston, Texas. Passionate about JavaScript and cyber-ing all the things.