The Accelerate HR Blog

Bulk imports into Rails: validate if you must   (Sun Feb 03 2008)

I've been importing data from a legacy database, and noticing how much time Rails spends on validation.

With just over 900 employee absence records to bring into Accelerate HR, I started by converting the original table into a csv file and then used a standard looped Rake task.

desc "Load archived absences into database."
task :load_absences => ["#{RAILS_ROOT}/lib/absences.csv", :environment] do | t | before_count = Absence.count

File.read(t.prerequisites.first).each do | line |

employee_id, first_day, last_day, days_not_counted, total_days, absencetype_id,
percent_to_pay, docs_provided, payroll_month, payroll_year, note = line.split(/,/)

Absence.create(:employee_id => employee_id,
:first_day => first_day,
:last_day => last_day,
:days_not_counted => days_not_counted,
:total_days => total_days,
:absencetype_id => absencetype_id,
:percent_to_pay => percent_to_pay,
:docs_provided => docs_provided,
:payroll_month => payroll_month,
:payroll_year => payroll_year,
:note => note
)
end
puts "Loaded #{Absence.count - before_count} entries."
end


The result? 5 minutes 25 seconds to complete the download.


So then I tried the ar_extensions method that I've been talking about recently. This is how the rake file looked:

desc "Load archived absences into database."
task (:load_absences => :environment) do

columns = [:employee_id, :first_day, :last_day, :days_not_counted, :total_days,
:absencetype_id , :percent_to_pay, :docs_provided, :payroll_month, :payroll_year, :note]

values = FasterCSV.read("#{RAILS_ROOT}/lib/absences.csv")
#Comment out this line when validation is on:
#options = { :validate => false }

before_count = Absence.count

#Comment out this line when validation is off
Absence.import columns, values

#Comment out this line when validation is on
#Absence.import columns, values, options

puts "Loaded #{Absence.count - before_count} entries."

end


A little better at 4 minutes 40 seconds - but actually I was pretty disappointed. After the spectacular success I'd had using ar-extensions to create new tables within the database, I was expecting a better gain.

So then I tried using ar-extensions with its handy validation off option. This was safe: the records had already been validated in the original database. (Sometimes it helps if you're maintaining the legacy database too!)

Now there was a quite dramatic increase: just 9 seconds for the task to complete, 36 times faster than the original rake task, and over 30 times faster than the same ar-extensions method with validation on.

So it seems that while working with a native sql statement rather than an ActiveRecord loop certainly improves speed, it's validation that's slowing everything down. Why should this be? Well, it looks as though the research just published by Nimble Method could be pointing us in the right direction, and beginning to give the right answers.

I got a further improvement of a second and a half by turning off the record count - but the importance of the count outweighs the small gain. It's my guarantee that all the records have been successfully imported.


So on to my next import task - nearly 40,000 overtime records. That's the monthly figure coming in from my Middle East construction client, working to a very tight schedule on a major project. And it's why I need efficient data import methods. Lots more records ... but a simpler 10-field table table this time - and just 24 seconds for all 40,000 items..

Now this isn't always going to work. More often than not, the data we import will need validation. But if speed is important and you're really sure it's not necessary to validate your source data, then I'd suggest you just don't do it.

Filed under: Ruby on Rails




Posted 4 months ago by kjetilge

This looks great. Perfect for importing my CSV list of schools for mass mailing. However this code seems not to work the way it's supposed to. I keep getting the error: Array can't be coerced into Fixnum.

I've printed the columns and values to screen and they look fine. Something goes wrong when I call School.import columns , values.

Have I forgotten something ?

This is my code:

namespace :database do
require 'faster_csv'
require 'ar-extensions'

desc "Load schools into database."
task (:load_schools => :environment) do
columns = [:navn,:care_off,:postadresse,:postnr,:poststed,:kommunenr,:kommune,:telefon,
:telefon_2,:telefaks,:klasser,:elever,:maalform,:epost,:internett,:adressetype]
values = FasterCSV.read("#{RAILS_ROOT}/lib/skoler.csv")
#Comment out this line when validation is on:
#options = { :validate => false }
p values
before_count = School.count
#Comment out this line when validation is off
#School.import_data columns, values
School.import columns , values

#Comment out this line when validation is on
#School.import columns, values, options

#puts "Loaded #{School.count - before_count} entries."

end

end
Posted 4 months ago by alanmiles

Hi kjetilge

Sorry not to have responded sooner. I've been having so much designing that I've neglected to look after the shop. Sorry too that your comment came out looking so awful. I really must add text formatting to the comment box.

From the error message, it looks as if Ruby's moaning about .csv trying to send text into a number field. You've got validation ON in your code, and that gives Ruby license to moan.

My first guess is that you might have set up your phone or fax fields to numerics, but then tried to send a number with the phone code separated from the number - or with a hyphen between the numbers - which would make it a string. It's my first guess because that's the kind of mistake I might have made.

Or, if not that, then could the ' p values ' line be the culprit? - I can't really see how that fits anyway.

You could clean up the rest a bit too to make it easier to read. For example, if you're not going to output the number of records imported at the completion of your task (the line's commented out), then you also don't need the ' before_count = School.count ' line.

Hope this helps.


More notes on super-fast data insertion in Rails   (Sat Feb 02 2008)

In my last post I explained how I've been using Zach Dennis's ar-extensions to achieve exponential speed increases when importing large amounts of data into my Rails HR database - Accelerate HR.

Now, thanks to the folks at Nimble Method, I understand better why bulk data imports and updates in Rails normally take so long to execute, and why ar-extensions works so well. It seems that we have a problem with garbage collection.

"Every time you allocate 8M of memory GC runs. Complex Rails requests can allocate hundreds or even thousands of megabytes of memory, making GC runs dozens of times. Each GC pass takes 50-150ms. You do the math."

OK, you say, then, let's stop collecting the garbage. Or let's send the garbage collectors round less often. But that's not such a good idea if you want your Rails site smelling sweet. In fact if you left the garbage out there long enough, civilized society on your site would just break down.

No, instead Nimble Method are suggesting a green approach. Generate less garbage.

And the way you do that is to patch your Ruby code. Just go to the post and you'll see a number of smart suggestions, together with an impressive set of before-and-after performance measurements.

This is a great piece of work, and I'm glad to read that the guys have been submitting these patches at enhancements to core Rails and Ruby. But for any medium- or large-scale data import or update job, it now makes even more sense to take the ar-extensions approach and to avoid altogether the need to loop through ActiveRecord::Base#create or ActiveRecord::Base#save. By creating a single sql statement for the whole collection of data, we have just one large task instead of several thousand small ones. And the result is much less garbage. And exponentially faster imports and updates.

Don't get me wrong. The Nimble Method patches are important. Sometimes, ar-extensions is not going to do the job for you. For example, although you can use model validation to check the data, callbacks are not supported. So when in my database I wanted to import a set of several hundred employee records from an external source, and simultaneously update employee benefit entitlements depending on contract-status, grade, length of service, I found myself going back to the standard ActiveRecord approach (which I described here). In circumstances like this, the patches are certainly going to be useful. And the Nimble Methods patches aren't restricted only to data import and update issues.

* * * * *

Finally following on from last time, you might be interested to see how I used ar-extensions to import from a CSV file using a rake task. I've installed two gems: ar-extensions and FasterCSV.

1. Create the rake task ... /lib/tasks/load_employees.rake

desc "Load new employees for location into database."
task (:load_employees => :environment) do
columns = [:first_name, :last_name, :staff_number, :job_id, :date_of_hire]
values = FasterCSV.read("#{RAILS_ROOT}/lib/employees.csv")
before_count = Employee.count
Employee.import columns, values
puts "Loaded #{Employee.count - before_count} entries."
end

2. Save the required data as /lib/employees.csv - making sure the columns in the CSV file are in the correct order, and omitting column headers. Note the number of records in the file.

3. Make sure these lines are at the end of the /config/environment file:

require 'fastercsv'
require 'ar-extensions'

4. Run the Rake task - and then check that the reported number of records imported is the same as in the original file.

(The next post in this series takes a closer look at validation when importing bulk data. It seems that whether validation is on or off makes a huge difference to your import speed.)

Filed under: Ruby on Rails






Super-fast bulk data imports in Rails with ar-extensions   (Sat Jan 19 2008)

Most Rails tutorials assume that when you import or move data it's going to be a record at a time. Yes, there are methods for importing bulk data. You can use migrations for initial data entry, and I've found FasterCSV useful for comma-delimited files.

But these are always painfully slow. Why? Because they iterate Create statements within a block. Like this:

FasterCSV.foreach("#{RAILS_ROOT}/lib/nationality.csv") do |row| Nationality.create(:nationality => row[1])
end

For each iteration, a separate SQL statement is being generated. And when you're importing thousands of records, that gives you plenty of time to go and get a coffee .. or two .. or three.

In the Accelerate HR database, we're going to be bulk-importing data frequently, notably when we capture and then interpret records from an attendance reader. Take a largish business - with a couple of thousand employees. They may clock in and out several times each day. We'll need to pick up the data once a day, and that could mean 10 - 20,000 records. Even if I ran the import in the background, that's quite a strain on the system - and delays of more than a second or two just aren't acceptable.

Well today, I solved the problem, thanks to a great piece of work from Zach Dennis. A process that was taking more than 10 minutes came down to less than 1 second.

Let me put you in the picture. I was working on the payroll setup. Each employee is assigned to a cost center, and in the Employee model we have the fields costcenter_id and costcenter_date, (this being the date when the cost center was assigned). But people may be moved from one cost center to another. The client I'm dealing with now supplies some of its staff to other businesses, so we use the cost center to determine how much to charge for these services. This means we also need a cost center history, showing all the assignments and the change-dates.

To get this started, I'd set the current cost centers for 1150 employees (using another automated process ... but that's another story.) When the user confirms that everyone's been assigned and the set-up is complete, I wanted to initialize the Costcenterhistory table sending the Employee id, costcenter_id and costcenter_date from Employee to Costcenterhistory. That's what was taking 10 minutes, using the normal looped Create method.

So I Googled around to find a better way, and came across Zach Dennis's ActiveRecord::Extensions - that's ar-extensions. (A quick heads-up here to Ilya Grigorik, who pointed me in the right direction.)

The way that Zach achieves exponential increases in speed is by adding methods to ActiveRecord that allow you to create a single SQL INSERT request instead of looping through a block. To go even faster, you can turn validation off - if you're sure that your import data has already been validated elsewhere of course.

I should add that ar-extensions isn't limited to the import feature. Zach also provides similar SQL native methods to accelerate your Finds, to benchmark your queries, and to create CSV files. You'll find all this and more in the docs.

But before I show you how it all worked for me, let me talk about the docs - and the hours of frustration trying to work it all out before I reached the wonderful Aha moment. I don't know if it's just me, but so much of the Rails documentation I come across takes me halfway there and then leaves me stranded. And Zach, I love your gem, but I wish you'd given me just a bit more help.

Installing the gem, no problem. And then right at the end of the documentation an intro to mass import/insertion of data. Open it up, and there's a description of what the gem does. I'm fine with that too. But then:

Basic Example

class Person < ActiveRecord::Base ; end

columns = [ :first_name, :last_name ]
values = [ [ 'Zach', 'Dennis' ], [ 'Mark', 'Van Holstyn' ], [ 'John', 'Doe' ] ]
Person.import column, values

And that's it.

Now, I'm new to extensions. I've never done this before. Do I add something like this at the bottom of my Costcenterhistory model, after the last end? So I try it.

class Costcenterhistory < ActiveRecord::Base
..
end

columns = [ :employee_id, :costcenter_id, :start_date ]
values = [ [ 1, 12, '2007-12-1' ], [2, 15, '2007-11-25' ] ]
Costcenterhistory.import column, values

And then I guess I need to call this in my Controller:

def setup_ok
Costcenterhistory.import
end

Well it doesn't work. Ah, there's a note in Ilya's post, telling me I need to require 'ar-extensions' in my controller. So I try that.

OK, it works. Or at least the 2 records are added to the database. But there's an error message: Invalid arguments! And a reference to the line number in the gem.

And I spend the next 4-5 hours trying adding arguments to the controller reference, reading anything useful I can find on ActiveRecord extensions, working my way through Zach's code. Nothing works. So I give up. And anyway there's a match I'm planning to watch on TV.

...With two hours away from it all, the answer is blindingly obvious. You don't need to add anything to the model at all! Zach's given us a new method - Model.import - that we can use immediately in the controller. I tried it and it worked immediately.

So my plea, not just to Zach but to everyone who writes all the wonderful Rails material, is to remember that accessibility is an issue for programmers: some of us are only partially sighted and we need our instructions bold and clear. Even if it's the obvious to you, state it for us.

Rant over. Let's see how it worked out.

A. Installed the gem.
B. Added require 'ar-extensions' to the end of /config/environment.rb
C. Model - no change
D. Controller:

def setup_ok

# find the session location - because I want to deal only with
# the employees in a single location.
@location = Location.find(session[:location])

#define the columns I want to import to
columns = [:employee_id, :costcenter_id, :date_from]

# Build the values statement with a Ruby array map
@employees = @location.employees
values = @employees.map { |e| [e.id, e.costcenter_id, e.costcenter_date.to_s] }

# Turn validation off to increase the speed.
# The original data has already been validated
options = { :validate => false }

# Call the ar-extensions method
Costcenterhistory.import columns, values, options

#return to the cost center menu page
redirect_to :controller => "cost_center"
end

Let's look at how I constructed the values statement.

Zach's Basic Example looks like this:
values = [ [ 'Zach', 'Dennis' ], [ 'Mark', 'Van Holstyn' ], [ 'John', 'Doe' ] ]

But clearly, with 1150 records to insert, I don't want to type in the values manually like this. Instead I use a Ruby array map to build the array of arrays. OK, it uses a block - it's iterative - but as we run through the loop we're not updating the database, only creating the SQL statement. So no performance hit.

Time to try it out. I was absolutely delighted - no flabbergasted would be a better word - when in an instant the process was completed. In fact, I didn't believe it had worked and wasn't convinced until I checked the database and found that all the records had been successfully imported. And then I had to try it again to convince myself.

Zach, thank you for a brilliant addition to my Rails tool-set. And I hope this little contribution helps to show others the way.


POSTSCRIPT

I later discovered that turning validation off during data imports - when possible - makes a huge difference to speed. Test details are here. In another post, I look at an outstanding piece of research from Nimble Method published in the last few days, helping to explain why some Rails processes can seem slow - and suggesting other fixes.

Filed under: Ruby on Rails




Posted 6 months ago by Mark Van Holstyn

Just thought I would throw out a refactoring for building the array of values...

@employees = Employee.find(:all, :conditions => ["location_id = ?", @location.id])
values = @employees.map { [e.id, e.costcenter_id, e.costcenter_date.to_s] }
Posted 6 months ago by alanmiles

Mark - thanks for pointing me towards array mapping. Great suggestion - and I've updated the post.

Just a small thing. Your values line is missing the block definition. It should read:

values = @employees.map { |e| [e.id, e.costcenter_id, e.costcenter_date.to_s] }
Posted 5 months ago by Ivan Vega

Nice! I didn't knew I needed this until this post. Only one question. Why are you querying Employee instead of just calling @location.employees ? Thanks!
Posted 5 months ago by alanmiles

Ivan - Because I didn't know any better. But I do now. Many thanks. I've updated the post, that was easy. Now to refactor the database ...
Posted 5 months ago by Timw

Lifesaver... thank you very much for this tutorial/blog entry I heard about this from the Rail Envy podcast on my way to work this morning. Knowing I was going to have to migrate a legacy database onto our new app in the next few weeks I was worried it was going to take forever. I wrote a rake task the standard way first, it was going to take the 62,000 rows about 4 or 5 hours. I then ported it to ar-extensions import feature and it took less then 20 minutes. Considering I have about 30 tables to do, some bigger some smaller, this is truly a lifesaver.
Posted 5 months ago by AllenH

Can someone show me what the raw SQL looks like that is generated by this batch extension? My understanding is Oracle doesn't support multiple-row inserts in a single INSERT statement. I'm wondering if this extension works with Oracle or only with MySQL and other RDBMs that support that INSERT syntax.
Posted 5 months ago by Anthony Eden

http://activewarehouse.rubyforge.org/adapter_extensions/ provides similar extensions that use native bulk loading.
Posted 4 months ago by btetampa

Much thanks to you and Zach for the library and instructions. I have an import process that was taking over an hour for 50K+ records which now takes less than 90 seconds. I'm using it from a rake task, not a controller, but it works just as well.


End-of-year HR reports   (Fri Dec 28 2007)

At one of my clients, the head of HR has taken a two-week vacation. While he's away, he's set his team a task - to prepare all the end-of-year HR reports. This week the team called me in to help them assemble the data.

This is a wonderful opportunity to do something important.

I've seen what they've produced before. It's the normal fare. Number of joiners, number of leavers, incidence of absence and lateness, appraisal scores etc. How does management use the material? I'm guessing, but judging from the data and from the way most businesses think around here, I'd say that it's going to be used mainly to calculate the end-of-year bonus. Cut the bonus if people were absent or late, or if the appraisal score was low, or if they only worked part of the year.

Which will make some people happy, some people sad, and do virtually nothing to improve the business.

My cunning plan is to give the boss a little surprise when he gets back. Oh, he'll get all the data he wants all right - consultants always need to do as they're told. But he's also going to get a set of 10 business-driver reports. (10 is entirely arbitrary, but seems to be about right: I don't want to flood him with data.)

So which 10 reports? I don't know yet. First I want to brainstorm with the team. And this promises to be an interesting session. I'll be asking them how they think HR contributes to the business, and why. There are some assumptions to be challenged here. For example, they've done a great job of recording all the training events. But has training actually made a difference to performance? If so, how is it measured? What's the return on investment in training? Or is it just enough to say that training's 'a good thing' - so the more, the better?

I've also posted a question at LinkedIn. 'What end-of-year reports can HR give management that will make a real contribution to the business's performance next year?' Already the responses have been pretty interesting. Here are a few highlights:

* Re-fills (positions filled more than once because the first person didn't work out)
* Internal hires/promotions/intercompany transfers (i.e. Do you really promote from within?)
* ... the strength of your branding and reputation as an employer to attract and retain talent ...
* Employee Return on Investment
* Cost of vacancies / recruitment

And in one response, a warning:

I have the funny feeling that particularly in HR quite a few people consider a job done with a report sent .... no report will add anything to your performance, if not acted upon.

If we can answer questions like these, and make sure they do lead to a course of action, then HR starts becoming a business-critical activity - a valuable profit-center rather than a tolerated cost-center.

In the next week or two, I'll be telling you how we got on, and which 10 reports we generated. But in the meantime, if you'd like to contribute your own ideas here, I'd love to hear them.

Filed under: HR






Payslips: Customizing tractor-feed page-size in Windows printouts   (Wed Dec 26 2007)

Most of my clients produce sealed payslips. You've seen them before. The top copy shows only the employee's name and department. When you tear open the paylip you get the full details on the carbon copy. And usually there's a bottom copy for the file.

On the whole, the sealed payslip is a good thing. It's great for payroll security. It cuts out the mindless task of stuffing hundreds of envelopes each month. And, if you're into antique technology, it means you can still use your noisy old impact printer.

But until today, I've always hated them. Why? Because the tractor-feed, pre-printed paper is always a non-standard size. Sort of, but not quite A5. And never one of the size options offered as standard by the printer manufacturer. So what happens? You print the first payslip and its fine. You print the second and it's a few centimetres lower on the sheet, and the employee's name slips out of the carbonated area. Print three and your totals are beginning to appear on the next page.

In the end, I always found a way to fix it, juggling with the settings on my page design, setting column spacing in the printer set-up. And we always took careful note of the settings, in case anything went wrong later. As it invariably did.

It went wrong today because one of my clients client had installed a new Windows version on all computers over the holiday period, and wiped out all the settings. So what were the settings? Nobody knew. The person who was keeping them safe left the business 3 years ago.

It all began to get pretty frustrating. I did what I could with the design. No good. I tried to check the printer manual. It was a pdf - and there was no Adobe Acrobat installed. So I downloaded it, and then of course found that IT had denied access to downloaded material. And there was no-one around with the authority to grant permission.

So I did what I should have done 10 years ago. Checked the web to find out whether anyone else was having the same problem.

They were, and the fix is really simple.

It turns out that all you have to do - on Windows anyway - is open up Printers and Faxes on the Control Panel. Then without selecting a printer, select Server Properties from the File sub-menu. In the dialog box that pops up, check the Create a new form box. Give your new form a name - 'Payslip' would do the trick - and enter the dimensions you want. Apply, and you're done.

So then go back to the payslip again in your database, and when you check the available printer settings, your newly created Payslip format will appear as one of the options. Print 500 perfectly-sized payslips, and rejoice.

Filed under: HR






List all blog entries filed under:

Employment Politics
HR
Implementation
Just thoughts
Ruby on Rails
Web 2.0

Can't find what you're looking for? Try this: -

Search blog for a word or phrase



 Subscribe to an RSS feed

Or get an email copy every time we post something new. Nothing new? Nothing mailed

Enter your email address:

Delivered by FeedBurner


If you're enjoying our blog, why not find out more, and maybe get involved?

ACCELERATE HR is a website built on Rails and designed for the enterprise. And we're building it live on the Web, right here.

Check out our home page HERE, or sign up for free HERE.


DESERT ISLAND BLOGS

Sharing a few of my favorites

HR Stimuli
McArthur's Rant

Jon Ingham's Strategic Human Capital Management Blog

The Rails Track
Railscasts

Web Power
The Technology Edge

Window on the Gulf
Mahmood's Den