Spidering JavaScript-manipulated HTML with PhantomJS - Stefaan Lippens inserts content here

Even if you are in an automated context, not using a typical browser, fetching the data/webpage from an URL is easy. There are command line tools like wget and curl, and every programming library has its share of libraries to build on.

But what if you need the HTML markup of page after it has been manipulated with JavaScript?

PhantomJS is a headless WebKit browser that is scriptable with a JavaScript API, useful for automated testing, screen capturing, page load time monitoring and the like. It's also handy for our problem to "download" JavaScript-manipulated page markup.

While there quite some pointers and examples on the internet, it was not immediately obvious how to get things working as desired. The biggest confusion came from the fact that the scriptable PhantomJS browser and the webpage run in separate "sandboxes". Apart from separate variable pools, console.log() messages from the webpage context won't be visible by default, while the ones from the PhantomJS context work normally, which was puzzling at first while debugging.

Anyway, here is some boilerplate to solve the problem, using page.evaluate() to evaluate a function in the webpage's context and returning the document's outerHTML back in the PhantomJS context to be rendered on standard output.

// The URL to load.
var url = 'http://www.google.com';

// Start a PhantomJS "page" and point it to the desired URL.
var page = require('webpage').create();
page.open(url, function(status) {

    if (status === 'success') {

        // Run a function in the webpage's context and catch what it returns.
        var html = page.evaluate(function() {
            // Optionally, do some more page manipulation here.
            // ...

            // Return the HTML from the loaded and JS-manipulated page.
            // Note that a console.log() here in this context won't be visible (by default).
            return document.documentElement.outerHTML;
        });

        // Print the HTML to standard output.
        console.log(html);
    }

    // Make sure we quit PhantomJS, no point to keep waiting for nothing.
    phantom.exit();
});

Update

Apparently there is a shorter way to do this, as pointed out by Ariya Hidayat, one of the guys behind PhantomJS:

var url = 'http://www.google.com';

var page = require('webpage').create();
page.open(url, function () {
    console.log(page.content);
    phantom.exit();
});

Great!

Still, in case you want to do some extra manipulation to the page before dumping it, the former boilerplate snippet might still be useful.