|RubyGems.org has made it much easier
for all of us to contribute Ruby gems
Nick Quaranto (@qrush) revolutionized gem authoring in 2009 by launching a new gem repository called Gemcutter.org. Suddenly, for the first time, any Ruby developer could publish a new gem simply by running “gem push my_awesome_gem.” The speed and simplicity of this new process caused an explosion of Ruby gem development and publishing. Gemcutter.org was later moved to RubyGems.org and became the Ruby community’s default gem repository.
I enjoyed listening to Nick chat with the RubyRogues about RubyGems.org a couple of weeks ago, especially the stories about how Nick got started developing Gemcutter and its early history. Then last week I had the opportunity to chat with Nick about how RubyGems.org actually works. I was curious to know more about what happens on the server when I push a new gem file, how it serves gems to everyone so quickly, and how it works with the new Bundler 1.1 dependency API. Here are a few highlights of our conversation…
Q: Hi Nick, thanks for your time… I really appreciate it!
Heyo – no problem.
Q: I was thinking of writing about how RubyGems.org works internally, and I decided… why not ask you first? So today I have a bunch of technical questions for you, and a few diagrams as well. You can set me straight and correct the mistakes in my diagrams :)
Pushing a New Gem to RubyGems.org
Q: What happens on the RubyGems.org server when someone runs “gem push”?
That’s a really good question. I’ve actually considered doing a talk on this subject. First, a controller picks it up the request and then we need to figure out what the client gave us. I think it’s the Pusher class that handles most of that. Using a four step process, it figures out:
- What did you give us?
- Have we seen it before?
- Are you someone who’s allowed to mess with this gem?
- Then we actually save it.
Given all of those things are cool: we’ve found the gem, it’s valid, you’re on the owners list, we then need to send both the gem and the gemspec out to S3, and kick off a job to refresh the gem indexes.
Q: You just mentioned “gem indexes” – what are the gem index files and why does RubyGems need them?
The index is what gem fetch, gem install, and gem list use to figure out what’s available. Refreshing that index takes a while, because there are almost 200,000 gems now.
Even though we can put each gem in a place where everyone can download it immediately, it’s still important that we update the index frequently. This is because actually the RubyGem clients don’t know how to download new gems until they appear in the index.
Q: How do you create the index files?
I used to rebuild the index immediately when you pushed, and it would take about a minute, and the user would just have to wait while it was churning away on the server. So, when I moved it to Heroku, they told me this was a bad idea since Heroku kills requests after 30 seconds. I had to figure out a better way to do it. So that’s how I learned about delayed_job.
Q: How many gems are there now? How long does it take the delayed_job background job to rebuild the index?
Right now there are 177,000 indexed gems, and it takes around a minute or so to generate the entire index. This is how long you have to wait for your gem to be installed. And that’s basically the whole process.
I think most people just assume that, once you push, it’s done – but it’s actually not. I don’t think we do a good job right now of telling them: “Hey you’ve got at least a minute to wait.” I think people are used to the instantaneousness of it now.
Q: How does RubyGems.org serve the actual gem files when developers run “gem install” or “bundle install?”
The app actually started out as two little Sinatra apps. One was for reimplementing the gem server; that’s on everyone’s RubyGems install and that’s how you can serve gems off of your own machine. I had to reimplement it because it had to be backed by a database and because it had to work outside of a file system. And the other Sinatra app was the UI. Although eventually I got to the point where I needed Rails, not Sinatra, for the UI.
I couldn’t call the Sinatra app “Gem Server” so I thought: What the closest thing to a server in a restaurant… and I said oh: “Hostess!” The “Hostess” is still in there, and that’s still Sinatra. And I think it’s fine for a Sinatra app, because the routes are really weird, because you’re sitting on a file system serving up actual files, because that’s how it was built, originally. And I think if they were Rails routes, and they certainly could be, they would be weird and gross.
Here’s one of the routes from the Hostess Sinatra app that Nick was talking about, which redirects clients to download the gem index files from S3:
Q: What is CloudFront? Why does RubyGems.org use that?
CloudFront is used to serve the gems and gemspecs because they never really change. CloudFront is a CDN so there are nodes all over the world: there’s some in Japan, 1 or 2 in Europe, a few in the US, China and I think there’s one in South America. Those serve the big downloads for us: the gems and the gemspecs. The nice thing is that when you push those they are atomic and never change, you never need to update one.
Redis and the Bundler 1.1 API
One of the most exciting new features of RubyGems.org is the fast API it provides to obtain gem dependency information. To learn more about the API and how Bundler 1.1 uses it, see my article from October: Why Bundler 1.1 will be much faster.
Q: I heard once that you used Redis to help implement the dependency API that Bundler 1.1 uses. Why did you use Redis for this?
We wanted it to be really fast, and I think when Matt Mongeau from Thoughtbot and I were originally playing with it, we may have tried to do it inside of Postgres first and the queries were looking really gross and long. And we knew the data wasn’t going to change – well I guess “the data isn’t going to change that much” isn’t a good argument. I really enjoy Redis and I like messing around with it and we were using it pretty heavily for counting downloads, I think I wanted to just try it out and see how fast it would be, and it ended up being really damn quick to resolve that first tree for the dependencies.
If we were to use Postgres for this, it would be a big join across three tables. And they’re big tables too, especially the dependency table. We wanted this to be fast – I don’t know maybe when Bundler 1.1 comes out it will all burn down.
Some Code Details
Q: Let’s take a closer look at the Pusher class, which you mentioned earlier, [Pusher] is the code that catches new gems and pushes them to S3 and CloudFront. I noticed that it’s not based on ActiveRecord and does not correspond to a database table. Is this a good idea in general?
I think it’s OK to have models that are ActiveRecord-ish. However, when I wrote this, it was before ActiveModel was around, so I would love for it to use validations and it kind of already uses “save”.
Here’s a small part of the Pusher class that we are discussing; this is the code that processes gems that are pushed to RubyGems.org:
Q: What about the “process” method – why did you decide to write it that way, calling out to 4 other methods? Is that a pattern that other people should learn from and emulate?
I see this kind of thing a lot – you split methods into discrete steps and call them one at a time, instead of having one giant method. I think that’s a fairly common pattern. I think the difference here is that I need them to be chained together. If I remember they used to be logical ANDs and not binary ANDs. I needed this to be a chain of things that happened and nothing with ActiveRecord kind of fits that.
Q: Who else helped you out on RubyGems.org? Are there any other contributors?
The first two people to mention are Tom Copeland (github twitter) and Evan Phoenix (github twitter). They’ve been helping out a lot! Evan’s on the RubyGems core team, so he knows a lot about what’s going on in the client side that I have no idea about. Tom’s been our sysadmin for a while; he’s been the sysadmin for RubyForge as well.
Those two guys have helped out a lot. Beyond that, pretty recently I put out a call for committers, and three guys have stepped up to help me:
They’ve been helping out a lot; Erik wrote a command line wrapper for the API called “gems” and he basically got inside of my brain when implementing the API and needed to discuss things with me that resolve the mental issues he was having with my brain :) So I decided: if you want to fix all these things then you might as well help me out with this! He’s been helping out a lot with pull requests and keeping the site updated.
Gabe is a Boston guy; he’s helped out a bunch with pull requests and such, and I’ve bounced a lot of ideas off of him. I’m not a sysadmin at all; I know how to do some things, but not when it comes to actual hardware and machines.
And Chris, who works at Swipely.com in Providence, has been great with not only fixing bugs but he wanted to help out making our infrastructure better. I’d say that those 5 have helped out a ton.
What Else Do You Need Help With?
Q: Aside from mirroring and the download graph API, both of which you mentioned on the RubyRogues podcast, what else do you need help with?
Anything on the issues list is open, if you want to work on it, we have plenty of people who will merge your stuff. If you’re looking for more “soft” things to do, things that don’t involve code, we have a help site (help.rubygems.org) that is filled with a lot of issues that can often be answered without having someone with server access involved. There are also a lot of questions on StackOverflow.
Another thing is that we have a “guides” site (guides.rubygems.org) This has tutorials – how would I do certain things? If you think there are things that people are running into a lot, you can throw that kind of thing here. For example, I have a little list going of talks I’ve found, tutorials and whatnot on a Resources page there.
Don’t be afraid to bug any of the contributors, especially if you’re looking for something to do. I’m more than happy to help people get contributing!
Thanks for all your time Nick – it was a real pleasure looking at your code and trying to get some understanding of what’s going on. Thanks a lot Nick! Bye…
I’m glad you didn’t run away screaming! That’s a good sign… thanks dude.