Bruno is a professional web developer from Croatia with Master's degrees in Computer Science and English Language and Literature. After having left his position as lead developer for a large online open access publisher, he now works as the PHP editor for SitePoint and on various freelance projects. When picking them, he makes sure they all involve new and exciting web technologies. In his free time, he writes tutorials on his blog and stalks Google's job vacancy boards.
The version of Phalcon we’ll be using in this Quick Tip will be 2.0 – a pre-release. If you read this when Phalcon is already in a mature 2.x stage, let us know and we’ll update the post. To install the stable 1.x version, just run
sudo apt-get install php5-phalconand it should work.
In the previous post on Harvesting SitePoint Authors’ Profiles with Diffbot we built a Custom API that automatically paginates an author’s list of work and extracts his name, bio and a list of posts with basic data (URL, title and date stamp). In this post, we’ll extract the links to the author’s social networks.
If you look at the social network icons inside an author’s bio frame on their profile page, you’ll notice they vary. There can be none, or there can be eight, or anything in between. What’s worse, the links aren’t classed in any semantically meaningful way – they’re just links with an icon and a href attribute.
This makes turning them into an extractable pattern difficult, and yet that’s exactly what we’ll be doing here because hey, who doesn’t love a challenge?
To get set up, please read and go through the first part. When you’re done, re-enter the dev dashboard.
Repeated Collections Problem
The logical approach would be to define a new collection just like for posts, but one that targets the social network links. Then, just target the href attribute on each and we’re set, right? Nope.
As the managing editor of the PHP channel for SitePoint, I deal with dozens of authors, hundreds of topics and a constantly full inbox. Filtering out inactive authors and pushing the prolific ones to the top of the queue is hard when the channel is this big and a one-man operation, so enlisting the help of bots only makes sense.
I recently started the construction of an in-depth work-analysis tool that helps me with social spread, reviews, activity tracking, personality profiling, language editing and more, hopefully automating a large portion of my work soon, and a key component is author activity. Specifically, tracking how much they publish in any given week, month or season.
Each SitePoint author has a profile page which lists their bio, their social network links, and their published posts. For example, here’s mine and here’s Peter’s. Each post snippet has the relevant information I need in order to track activity: a date, a title and a URL. By grabbing all of an author’s posts, we can group them by date and extract some statistics.
Granted, the publication time depends on a variety of factors – from my own ability to squeeze reviews into the current work queue, to sponsors and other channel preferences. Still, any insight is good insight, and as my tool helps me automate parts of my workflow, reviews will happen sooner.
That said, how can we fetch this author data reliably?
To API or not to API
The logical approach would be to consume an API. Something like a call to
https://api.sitepoint.com/v1/author/bskvorc?area=postswould make the entire task a breeze. Alas, SitePoint has no API and we’re forced to crawl it, unless we have database access (for the purpose of this demo, let’s assume we don’t).
Diffbot to the rescue! We’ve written about Diffbot before, so give our introductory post a read if you haven’t already to get familiar with it. In a nutshell, we’ll use Diffbot to automatically crawl all the pages of an author’s profile, extract the data we need, and get it back in JSON format.
As per this post:
Processing is an environment/programming language that is meant to make visual, interactive applications extremely easy to write. It can be used for everything from teaching children how to code to visualizing scientific data.
It’s the language that’s partially behind wizardry like this:
and, of course, everything you can find here.
But, if we had processing.js before, what’s P5.js?
What is P5.js?
to make coding accessible for artists, designers, educators, and beginners, and reinterprets this for today’s web
So, it sounds like Processing itself. But what is it really?
Ease up, confused reader, we’ll get to it! First, watch their amazingly enthusiastic introduction here, then come back.
Did it click? Get it now? No? Ok. Let’s break it down.
It’s far too often that I see people shying away from newest technologies in the spirit of backwards compatibility. “We can’t move the minimum PHP requirement to 5.5 because we have 50% of our users on 5.4 still!”, they say. “There’s no way for us to move to Guzzle 4+, our back end is built on version 3 and it would take too much time and money”. I like the common argument from WordPress the best: “We can’t go full OOP and logic/presentation decoupling, because most of our users are running shared hosts with PHP 5.1 or don’t know OOP and/or MVC”.
Legacy Code – a big NO
This might come out controversial, but I firmly believe there is no room for legacy code in modern systems. Allow me to elaborate before you sharpen your pitchfork and light your torch. What I mean by that is: there should be absolutely zero reason to keep implementing the functions you’re adding to the new version retroactively into the old version, just because some people are still using it, even if the people using it are a vast majority.
To clarify: bugfixing legacy versions until their long term support contract runs out or you feel like it if you’re in charge, yes. Adding new features you think up for version X into version X-1 in order not to make the X-1 users mad, absolutely and 100% not. Likewise, adding X-1 code into version X just because it can “serve the purpose” should be illegal. If you’re still charging people for X-1 and basing your upgrades on that, your business plan is bad and you should feel bad.
Who am I to spout such nonsense, though, right? I’ve never had to maintain a large project with stakeholders and boards to please, a project that moves super slow and makes everyone happy as long as it works, no matter the fact that it could, potentially, work 100 times safer and 1000 times faster, right? Not exactly. My biggest baby was a big publisher site with a complex back end, built on ZF1. If you’ve ever done anything in ZF1, you know what a vortex of painful antipatterns it is. When the application started showing signs of deterioration due to increased traffic, I rebuilt the front end of the back end (the most heavily used part of the app) in its entirety on an ajax interface and API calls, lightening the load drastically and buying enough time to rebuild the entire suite of applications we had on the only thing the higher ups allowed – Zend Framework 2. If you’ve done anything on that, you know it’s a slightly less dense vortex of antipatterns, but still a vortex of antipatterns and bloat – but what I’m trying to say here is – huge upgrades and total rewrites can happen, if capable people are behind them. If all you’re doing are agile meetings and brainstorming, there’s no amount of LTS contracts that can stop you from looking stupid in five years.
Even if you’re doing free and/or open source work, you shouldn’t break your back for X-1 users, because you’re only doing them a favor by doing a major version increment, and with it, a major upgrade with a potential BC break. They should either adapt, or wither away.
So why should we exile legacy code from modern systems?
A handful of news cropped up again that didn’t really get the attention they deserved, so I’ll use this opportunity to rehash some of them and explain others. The “news” here are usually less than brand new – instead, they’re bits of information you should pay attention to if you’re even the least bit interested in the PHP community and environment.
The Zend Rush
Zend, the company behind anything that has “Zend” in its name (Framework, Server, Studio, Engine…) has been very aggressive in product updates of late. They started the year off with a new release of their Zend Certification exam, continued with a huge update to the Zend Server, which we’ve covered in another post, and wrapped things up by updating Zend Studio to a new major version – it now goes to 11! We’ll be taking a more in-depth look at it in another post.
Have you ever wondered how social networks do URL previews so well when you share links? How do they know which images to grab, whom to cite as an author, or which tags to attach to the preview? Is it all crawling with complex regexes over source code? Actually, more often than not, it isn’t. Meta information defined in the source can be unreliable, and sites with less than stellar reputation often use them as keyword carriers, attempting to get search engines to rank them higher. Isn’t what we, the humans, see in front of us what matters anyway?
If you want to build a URL preview snippet or a news aggregator, there are many automatic crawlers available online, both proprietary and open source, but you seldom find something as niche as visual machine learning. This is exactly what Diffbot is – a “visual learning robot” which renders a URL you request in full and then visually extracts data, helping itself with some metadata from the page source as needed.
After covering some theory, in this post we’ll do a demo API call at one of SitePoint’s posts.
The PHP library for Diffbot is somewhat out of date, and as such we won’t be using it in this demo. We’ll be performing raw API calls, and in some future posts we’ll build our own library for API interaction.
Zend Server 7 is an excellent tool for managing, deploying and monitoring your PHP applications. We’ve covered its installation in this quick tip, and we’ve given it a somewhat thorough review in this post.
In this Quick Tip, we’ll go through the procedure of installing a custom PHP extension into it. We’ll be installing Phalcon, but the procedure is identical for nearly all extensions out there.
Step 1: Install Zend Server
Have an instance of ZS up and running. Follow this quick tip to do that.
Step 2: Modify the $PATH
To use the command line PHP tools that come bundled with Zend Server, we need to add the path to them to the system $PATH variable:
Zend Technologies is the company which funds the development of the Zend Engine (the engine PHP is based on), as well as Zend Framework and some other projects like Apigility. They also build proprietary software of high professional caliber, designed for high intensity PHP work in large companies – software like Zend Studio and Zend Server – though both are excellent tools for solo devs as well. In this post, we’ll be taking a look at the latter.
What is Zend Server?
Zend Server is, essentially, a locally-run web application which helps you run, deploy, debug and production-prepare other applications you write. It’s more than a developer helper, though – you can install it on your production servers and have it take care of hosting, clustering, file distribution and more.
It automatically installs Zend Framework (both version 1 and 2 for some reason) and Symfony 2, and supports GUI-based management of other libraries and projects for total ease of use. All operating systems and platforms are supported, and you can install it alongside Apache or Nginx – your choice. You can have it pull in PHP version 5.4 or 5.5, and it will do the rest on its own once you run the installation script.
The latest version of ZS, version 7, comes in several licenses and flavors, so give those a read if you’d like to know about the differences.
The concept of Zend Server might be a bit too abstract to grasp right now if you’ve never encountered it before, so let’s just walk through it instead.
I recently took a look at Zend Server 7, the latest version of the powerful application monitor/manager suite. This quick tip will show you how to get it installed on a Vagrant box so you too can experiment with its features.
Step 1: Install Prerequisites
Make sure you have Virtualbox and Vagrant installed – the newer the better.
Step 2: Clone and Boot
Clone this repository. Adapted from Homestead Improved and originally Homestead, this setup will boot up a bare-bones Trusty (Ubuntu 14.04 LTS 64bit) VM. After the cloning is complete, boot it with
vagrant up. The only real difference from a truly bare bones Trusty box is the fact that we’ve forwarded port 10081 which is what Zend Server uses by default.
git clone https://github.com/Swader/trustead cd trustead vagrant up
After the booting is done, enter the VM with