Diffbot: Repeated Collections and Merged APIs

Share this article

In the previous post on Analyzing SitePoint Authors’ Profiles with Diffbot we built a Custom API that automatically paginates an author’s list of work and extracts his name, bio and a list of posts with basic data (URL, title and date stamp). In this post, we’ll extract the links to the author’s social networks.

Introduction

If you look at the social network icons inside an author’s bio frame on their profile page, you’ll notice they vary. There can be none, or there can be eight, or anything in between. What’s worse, the links aren’t classed in any semantically meaningful way – they’re just links with an icon and a href attribute.

This makes turning them into an extractable pattern difficult, and yet that’s exactly what we’ll be doing here because hey, who doesn’t love a challenge?

To get set up, please read and go through the first part. When you’re done, re-enter the dev dashboard.

Repeated Collections Problem

The logical approach would be to define a new collection just like for posts, but one that targets the social network links. Then, just target the href attribute on each and we’re set, right? Nope.

Observe below:

As you can see, we get all the social links. But we get them all X times, where X is the number of pages in an author’s profile. This happens because the Diffbot API concatenates the HTML of all the pages into a single big one, and our collection rule finds several sets of these social network icon-links.

Intuition might lead you to use a :first-child pseudo element on the parent of the collection on the first page, but the API doesn’t work like that. The HTML contents of the individual pages are concatenated, yes, but the rules are executed on them first. In reality, only the result is being concatenated. This is why it isn’t possible to use main:first-child to target the first page only. Likewise, at this moment the Diffbot API does not have any :first-page custom pseudo elements, but them appearing at a later stage is not out of the question. How, then, do we do this?

Custom Domain Regex and API Dupes

Diffbot allows you to define several custom rulesets for the same API endpoint, differing by domain regex. When an API endpoint is called, all the rulesets that match the URL are executed, the results are concatenated, and you get a unique set back, as if it was all in a single API. This is what we’re going to do, too.

New Old API

Start off by going to “Create a rule” and selecting a Custom API, so you get asked for a name. Enter the same name as the one in the first part (in my case, AuthorFolio). Enter the typical test url (https://www.sitepoint.com/author/bskvorc/) and run the Test. Then, change the domain regex to this:

(http(s)?://)?(.*\.)?sitepoint.com/author/[^/]+/

This tells the API to only target the first page of any author profile – it ignores pagination completely.

Define a Collection

Next, define a new collection. Call it “social” and give it a custom field with the selector of .contributor_social li. Name the field “link”, and give it a selector of “a” with an attribute filter of href. Save, wait for the reload, and notice that you now have the four links extracted:

Social Network Names

But having just the links there kind of sucks, doesn’t it? It would be nice if we had a social network name, too. SitePoint’s design, however, doesn’t class them in any semantically meaningful way, so there’s no easy way to get the network name. How can we tackle this?

Regex Rewrite Filters to the rescue!

Custom fields have three available filters:

  • attribute: extracts an HTML element’s attribute
  • ignore: ignores certain HTML elements based on a css selector
  • replace: replaces the content of the output with the given content if a regex pattern matches

We’ll be using the third one – read more about them here.

Add a new field to our “social” collection. Give it the name “network”, the selector a, and an attribute filter of href so it extracts the link just like the “link” field. Then, add a new “replace” filter.

SitePoint author profiles can have the following social networks attached to their profiles: Google+, Twitter, Facebook, Reddit, Youtube, Flickr, Github and Linkedin. Luckily, each of those has pretty straightforward URLs with full domain names, so regexing the names out is a piece of cake. The correct regex is ^.*KEYWORD.*$:

Save, wait for the reload, and notice that you now have a well formed collection of an author’s social links.

Bringing the APIs together

Finally, let’s fetch all this data at once. According to what we said above, executing a call to an author page with the AuthorFolio API should now give us a single JSON response containing the sum of everything we’ve defined so far, including the fields from the first post. Let’s see if that’s true. Visit the following link in your browser:

http://diffbot.com/api/AuthorFolio?token=xxxxxxxxx&url=https://www.sitepoint.com/author/bskvorc/

This is the result I get:

As you can see, we successfully merged the two APIs and got back a single result of everything we wanted. We can now consume this API URL at will from any third party application, and pull in the portfolio of an author, easily grouping by date, detecting changes in the bio, registering newly added social networks, and much more.

Conclusion

In this post we looked at some trickier aspects of visual crawling with Diffbot like repeated collections and duplicate APIs on custom domain regexes. We built an endpoint that allows us to extract valuable information from an author’s profile, and we learned how to apply this knowledge to any similar situation.

Did you crawl something interesting using these techniques? Did you run into any trouble? Let us know in the comments below!

Bruno SkvorcBruno Skvorc
View Author

Bruno is a blockchain developer and technical educator at the Web3 Foundation, the foundation that's building the next generation of the free people's internet. He runs two newsletters you should subscribe to if you're interested in Web3.0: Dot Leap covers ecosystem and tech development of Web3, and NFT Review covers the evolution of the non-fungible token (digital collectibles) ecosystem inside this emerging new web. His current passion project is RMRK.app, the most advanced NFT system in the world, which allows NFTs to own other NFTs, NFTs to react to emotion, NFTs to be governed democratically, and NFTs to be multiple things at once.

aiapicrawlingDiffbotmachine learning
Share this article
Read Next
Get the freshest news and resources for developers, designers and digital creators in your inbox each week