Headless WebKit and PhantomJS
If you’re reading this article, you most likely know what a browser is. Now take away the GUI, and you have what’s called a headless browser. Headless browsers can do all of the same things that normal browsers do, but faster. They’re great for automating and testing web pages programmatically. There are a number of headless browsers in existence, and PhantomJS is the best.
Built on top of WebKit, the engine behind Chrome and Safari, PhantomJS gives you a ton of browser power, without the heavy GUI. Getting started with PhantomJS is easy – just download the executable. Next, create a file named hello.js
and add the following lines.
console.log("Hello World!");
phantom.exit();
To execute the script, run the command shown below. Note, the phantomjs
executable must either be in your current directory, or somewhere in your environment’s PATH
. If everything is configured properly, PhantomJS will print Hello World!
to the console, and then terminate when phantom.exit()
is called.
phantomjs hello.js
Working With Web Pages
Once PhantomJS is up and running, you can begin automating the Web. The following example loads the Google home page, and then saves a screenshot to a file. Line 1 creates a new instance of a web page. On line 4, the web page loads google.com
. Once the page finishes loading, the onLoadFinished()
callback function is executed. The callback receives a single argument, status
, which indicates whether the page loaded successfully or not. The URL of the loaded page is available in page.url
. This property can be particularly useful when pages contain redirects, and you want to know exactly where you landed. A screenshot is taken on line 8 using the page’s render()
method. render()
can create PNG, GIF, JPEG, and PDF files.
var page = require("webpage").create();
var homePage = "http://www.google.com/";
page.open(homePage);
page.onLoadFinished = function(status) {
var url = page.url;
console.log("Status: " + status);
console.log("Loaded: " + url);
page.render("google.png");
phantom.exit();
};
Page Settings
Page objects have a number of settings which can be customized based on your application’s needs. For example, if you’re only interested in downloading source code, you can speed up your application by ignoring image files and turning off JavaScript. The previous example is rewritten below to reflect these changes. The changed settings are shown on lines 3 and 4. Note that any settings changes must take place before the call to open()
. If you view the screenshot from this example, you will notice that the Google logo image is missing, but the rest of the page is in tact.
var page = require("webpage").create();
var homePage = "http://www.google.com/";
page.settings.javascriptEnabled = false;
page.settings.loadImages = false;
page.open(homePage);
page.onLoadFinished = function(status) {
var url = page.url;
console.log("Status: " + status);
console.log("Loaded: " + url);
page.render("google.png");
phantom.exit();
};
Accessing the File System
So far, our examples have loaded pages and saved screenshots as image files. While this is undoubtedly cool, many applications would prefer to store the source code to the file system. PhantomJS makes this possible by providing an extensive file system API. The following example uses the FileSystem
module to write the google.com
source code to a file. First, the FileSystem
module is imported on line 2. On line 6, the output file is opened for writing. On line 7, the data is written to file using the write()
method. The actual source code is available via the page’s content
property. Finally, the file is closed and PhantomJS is terminated.
var page = require("webpage").create();
var fs = require("fs");
var homePage = "http://www.google.com/";
page.open(homePage);
page.onLoadFinished = function(status) {
var file = fs.open("output.htm", "w");
file.write(page.content);
file.close();
phantom.exit();
};
Executing JavaScript
One of the most powerful features of PhantomJS is the ability to interact with a page via JavaScript. This makes it extremely easy to automate tasks such as clicking buttons and submitting forms. Our next example performs a Web search by loading the Google home page, entering a query, and then submitting the search form. The beginning of the example should look familiar. The new stuff begins, on line 8, where we determine which page has been loaded. If this is the home page, the page’s evaluate()
method is called. evaluate()
executes the code you provide in the context of the page. This essentially gives you the same power as the page’s original developer. How cool is that?
var page = require("webpage").create();
var homePage = "http://www.google.com/";
page.open(homePage);
page.onLoadFinished = function(status) {
var url = page.url;
console.log("Status: " + status);
console.log("Loaded: " + url);
if (url === homePage) {
page.evaluate(function() {
var searchBox = document.querySelector(".lst");
var searchForm = document.querySelector("form");
searchBox.value = "JSPro";
searchForm.submit();
});
} else {
page.render("results.png");
phantom.exit();
}
};
Inside of evaluate()
, we locate the search box and form. We set the value of the search box to “JSPro”, and then submit the form. This will cause the page’s onLoadFinished()
method to be triggered again. However, this time a screen shot is taken of the search results, and PhantomJS exits.
PhantomJS also provides two methods, includeJs()
and injectJs()
, which allow you to add external script files to a page. includeJs()
is used to include any script file that is accessible from the page. For example, you can include jQuery in our previous example using the following code. Notice the call to includeJs()
on line 9, as well as the jQuery syntax inside of evaluate()
.
var page = require("webpage").create();
var homePage = "http://www.google.com/";
page.open(homePage);
page.onLoadFinished = function(status) {
var url = page.url;
console.log("Status: " + status);
console.log("Loaded: " + url);
if (url === homePage) {
page.includeJs("https://code.jquery.com/jquery-1.8.3.min.js", function() {
console.log("Loaded jQuery!");
page.evaluate(function() {
var searchBox = $(".lst");
var searchForm = $("form");
searchBox.val("JSPro");
searchForm.submit();
});
});
} else {
page.render("results.png");
phantom.exit();
}
};
The injectJs()
method is similar to includeJs()
. The difference is that the injected script file does not need to be accessible from the page object. This allows you to, for example, inject scripts from your local file system.
PhantomJS and Node.js
Sadly, PhantomJS does not integrate particularly well with Node.js. A few projects have been created which try to control PhantomJS from Node.js, but they are all a bit of a kludge. Existing projects use the child process module to spawn instances of PhantomJS. Next, PhantomJS loads a special web page, which uses WebSockets to communicate with Node.js. It might not be ideal, but it works.
Two of the more popular PhantomJS Node modules are node-phantom and phantomjs-node. I recently started working on my own PhantomJS Node module named ghostbuster. Ghostbuster is similar to node-phantom, but attempts to reduce callback nesting by providing more powerful commands. Making fewer calls to PhantomJS also means less time is wasted communicating over WebSockets. Another alternative is zombie.js, a lightweight headless browser built on top of jsdom. Zombie isn’t as powerful as PhantomJS, but it is a true Node.js module.
Conclusion
After reading this article, you should have a basic grasp on PhantomJS. One of the nicest features about PhantomJS is how simple it is to use. If you’re already familiar with JavaScript, the learning curve is minimal. PhantomJS also supports a variety of other features that weren’t covered in this article. As always, I encourage you to check out the documentation. There are also a number of examples which show off PhantomJS in all its glory!