PHP - - By Bruno Skvorc

Unlike in our “mainstream” paid course about exploring PHP, I like to explore the weird and forgotten areas of the language.

Recently, I ventured into a section of the PHP manual which lists extensions that are used to help with Human Language and Character Encoding. I had never looked at them as a whole – while dealing with gettext, for example, I always kind of landed directly on it and ignored the rest. Well, of those others, there’s one that caught my eye – especially in this day and age given the various controversies – the Gender extension.

Pink and green elephant symbolizing gender roles

This extension, in short, tries to guess the gender of first names. As its introduction says:

Gender PHP extension is a port of the gender.c program originally written by Joerg Michael. The main purpose is to find out the gender of firstnames. The current database contains >40000 firstnames from 54 countries.

This is interesting beyond the fact that the author is kinda called George Michael. In fact, there are many aspects of this extension that are quite baffling.

While its last stable release was in 2015, the extension uses namespaces which clearly indicates that it’s not some kind of long lost remnant of the past – a relatively recent effort was made to make it conform to modern coding standards. Even the example code uses namespaces:

<?php

namespace Gender;

$gender = new Gender;

$name = "Milene";
$country = Gender::FRANCE;

$result = $gender->get($name, $country);
$data = $gender->country($country);

switch($result) {
    case Gender::IS_FEMALE:
        printf("The name %s is female in %s\n", $name, $data['country']);
    break;

    case Gender::IS_MOSTLY_FEMALE:
        printf("The name %s is mostly female in %s\n", $name, $data['country']);
    break;

    case Gender::IS_MALE:
        printf("The name %s is male in %s\n", $name, $data['country']);
    break;

    case Gender::IS_MOSTLY_MALE:
        printf("The name %s is mostly male in %s\n", $name, $data['country']);
    break;

    case Gender::IS_UNISEX_NAME:
        printf("The name %s is unisex in %s\n", $name, $data['country']);
    break;

    case Gender::IS_A_COUPLE:
        printf("The name %s is both male and female in %s\n", $name, $data['country']);
    break;

    case Gender::NAME_NOT_FOUND:
        printf("The name %s was not found for %s\n", $name, $data['country']);
    break;

    case Gender::ERROR_IN_NAME:
        echo "There is an error in the given name!\n";
    break;

    default:
        echo "An error occurred!\n";
    break;

}

While we have this code here, let’s take a look at it.

Some really confusing constant names in there – how does a name contain an error? What’s the difference between unisex and couple names? Digging deeper, we see some more curious constants.

For example, the class has short names of countries as constants (e.g. BRITAIN) which reference an array containing both an international code for the country (UK) and the full country name (GREAT BRITAIN).

$gender = new Gender\Gender;
var_dump($gender->country(Gender\Gender::BRITAIN));

array(2) {
  'country_short' =>
  string(2) "UK"
  'country' =>
  string(13) "Great Britain"
}

Only, UK isn’t the international code one would expect here – it’s GB. Why they chose this route rather than rely on an existing package of geonames or even just an accurate list of constants is anyone’s guess.

Once in use, the class uses the get method to return the gender of a name, provided we’ve given it the name and the country (optional – searches across all countries if omitted). But the country has to be the constant of the class (so you need to know it by heart or use their values when adding it to the UI because it won’t match any standard country code list) and it also returns an integer – another constant defined in the class, like so:

const integer IS_FEMALE = 70 ;
const integer IS_MOSTLY_FEMALE = 102 ;
const integer IS_MALE = 77 ;
const integer IS_MOSTLY_MALE = 109 ;
const integer IS_UNISEX_NAME = 63 ;
const integer IS_A_COUPLE = 67 ;
const integer NAME_NOT_FOUND = 32 ;
const integer ERROR_IN_NAME = 69 ;

There’s just no rhyme or reason to any of these values.

Another method, isNick, checks if a name is a nickname or alias for another name. This makes sense in cases like Bob vs Robert or Dick vs Richard, but can it really scale past these predictable English values? The method is doubly confusing because it says it returns an array in the signature, whereas the description says it’s a boolean.

Wrong description of method return type

Finally, the similarNames method will return an array of names similar to the one provided, given the name and a country (if country is omitted, then it compares names across all countries). Does this include aliases? What’s the basis for similarity? Are Mario and Maria similar despite being opposite genders? Or is Mario just similar to Marek? Is Mario similar to Marek at all? There’s no information.

I just had to find out for myself, so I installed it and tested the thing.

Installation

I tested this on an isolated environment via Homestead Improved with PECL pre-installed.

sudo pecl install gender
echo "extension=gender.so" | sudo tee /etc/php/7.1/mods-available/gender.ini
sudo phpenmod gender
pear run-scripts pecl/gender

The last command will ask where to put a dictionary. I assume this is there for the purposes of extending it. I selected ., as in “current folder”. Let’s try it out by making a simple index.php file with the example content from above and testing that first.

Milene is female in France

Sure enough, it works. Okay, let’s change the country to $country = Gender::CROATIA;.

Milene is female in Croatia

Okay, sure, it’s not a common name, and not in that format, but it’s most similar to Milena, which is a female name in Croatia. Let’s see what’s similar to Milena via similar.php:

<?php
namespace Gender;

$gender = new Gender;
$similar = $gender->similarNames("Milena", Gender::CROATIA);

var_dump($similar);

Milena has no similar names?

Not what I expected. Let’s see the original, Milene.

Milene has odd similarities

So Milena is listed as a name similar to Milene, but Milene isn’t similar to Milena? Additionally, there seem to be some encoding issues on two of them? And the Croatian alphabet doesn’t even have the letter “y”, we definitely have neither of those similar names, regardless of what’s hiding under the question mark.

Okay, let’s try something else. Let’s see if Bob is an alias of Robert in alias.php:

<?php
namespace Gender;
$gender = new Gender;
var_dump($gender->isNick('Bob', 'Robert', Gender::USA));

Bob is an alias of Robert

Indeed, that does seem to be true. Low hanging fruit, though. Let’s see a local one.

var_dump($gender->isNick('Tea', 'Dorotea', Gender::CROATIA));

Tea is not an alias of Dorotea

Oh come on.

What about the Mario / Maria / Marek issue from the beginning? Let’s see similarities for them in order.

Mario is similar to himself and a misencoded version of himself
Maria is similar to herself and a misencoded version of herself
Marek is similar only to himself

Not good.

A couple more tries. To make testing easier, let’s change the $name and $country lines in index.php to:

$name = $argv[1];
$country = constant(Gender::class.'::'.strtoupper($argv[2]));

Now we can test from the CLI without editing the file.

Final few tries. I have a female friend from Tunisia called Manel. I would assume her name would go for male in most of the world because it ends with a consonant. Let’s test hers and some other names.

Cannot find Tunisia

No Tunisia? Maybe it isn’t documented in the manual, let’s output all the defined constants and check.

// constants.php
<?php

$oClass = new ReflectionClass(Gender\Gender::class);
var_dump($oClass->getConstants());

Missing countries in list

No, looks like those docs are spot on. At this point, I stop my playing around with this tool.


The whole situation is made even more interesting by the fact that this is a simple class, and definitely doesn’t need to be an extension. No one will call this often enough to care about the performance boost of an extension vs. a package, and a package can be installed by non-sudo users, and people can contribute to it more easily.

How this extension, which is both inaccurate and incomplete, and could be a simple class, ended up in the PHP manual is unclear, but it goes to show that there’s a lot of cleaning up to be done yet in the PHP core (I include the manual as the “core”) before we get PHP’s reputation up. In the 9 years (nine!) since development on this port started, not even all countries have been added to the internal list and yet someone decided this extension should be in the manual.

Do you have more information about this extension? Do you see a point to it? Which other oddball extensions or built-in features did you find in the manual or in PHP in general?

Sponsors