Unlike in our “mainstream” paid course about exploring PHP, I like to explore the weird and forgotten areas of the language.
Recently, I ventured into a section of the PHP manual which lists extensions that are used to help with Human Language and Character Encoding. I had never looked at them as a whole – while dealing with gettext, for example, I always kind of landed directly on it and ignored the rest. Well, of those others, there’s one that caught my eye – especially in this day and age given the various controversies – the Gender extension.
This extension, in short, tries to guess the gender of first names. As its introduction says:
Gender PHP extension is a port of the
gender.c
program originally written by Joerg Michael. The main purpose is to find out the gender of firstnames. The current database contains >40000 firstnames from 54 countries.
This is interesting beyond the fact that the author is kinda called George Michael. In fact, there are many aspects of this extension that are quite baffling.
While its last stable release was in 2015, the extension uses namespaces which clearly indicates that it’s not some kind of long lost remnant of the past – a relatively recent effort was made to make it conform to modern coding standards. Even the example code uses namespaces:
<?php
namespace Gender;
$gender = new Gender;
$name = "Milene";
$country = Gender::FRANCE;
$result = $gender->get($name, $country);
$data = $gender->country($country);
switch($result) {
case Gender::IS_FEMALE:
printf("The name %s is female in %s\n", $name, $data['country']);
break;
case Gender::IS_MOSTLY_FEMALE:
printf("The name %s is mostly female in %s\n", $name, $data['country']);
break;
case Gender::IS_MALE:
printf("The name %s is male in %s\n", $name, $data['country']);
break;
case Gender::IS_MOSTLY_MALE:
printf("The name %s is mostly male in %s\n", $name, $data['country']);
break;
case Gender::IS_UNISEX_NAME:
printf("The name %s is unisex in %s\n", $name, $data['country']);
break;
case Gender::IS_A_COUPLE:
printf("The name %s is both male and female in %s\n", $name, $data['country']);
break;
case Gender::NAME_NOT_FOUND:
printf("The name %s was not found for %s\n", $name, $data['country']);
break;
case Gender::ERROR_IN_NAME:
echo "There is an error in the given name!\n";
break;
default:
echo "An error occurred!\n";
break;
}
While we have this code here, let’s take a look at it.
Some really confusing constant names in there – how does a name contain an error? What’s the difference between unisex and couple names? Digging deeper, we see some more curious constants.
For example, the class has short names of countries as constants (e.g. BRITAIN
) which reference an array containing both an international code for the country (UK
) and the full country name (GREAT BRITAIN
).
$gender = new Gender\Gender;
var_dump($gender->country(Gender\Gender::BRITAIN));
array(2) {
'country_short' =>
string(2) "UK"
'country' =>
string(13) "Great Britain"
}
Only, UK
isn’t the international code one would expect here – it’s GB
. Why they chose this route rather than rely on an existing package of geonames or even just an accurate list of constants is anyone’s guess.
Once in use, the class uses the get
method to return the gender of a name, provided we’ve given it the name and the country (optional – searches across all countries if omitted). But the country has to be the constant of the class (so you need to know it by heart or use their values when adding it to the UI because it won’t match any standard country code list) and it also returns an integer – another constant defined in the class, like so:
const integer IS_FEMALE = 70 ;
const integer IS_MOSTLY_FEMALE = 102 ;
const integer IS_MALE = 77 ;
const integer IS_MOSTLY_MALE = 109 ;
const integer IS_UNISEX_NAME = 63 ;
const integer IS_A_COUPLE = 67 ;
const integer NAME_NOT_FOUND = 32 ;
const integer ERROR_IN_NAME = 69 ;
There’s just no rhyme or reason to any of these values.
Another method, isNick
, checks if a name is a nickname or alias for another name. This makes sense in cases like Bob vs Robert or Dick vs Richard, but can it really scale past these predictable English values? The method is doubly confusing because it says it returns an array in the signature, whereas the description says it’s a boolean.
Finally, the similarNames
method will return an array of names similar to the one provided, given the name and a country (if country is omitted, then it compares names across all countries). Does this include aliases? What’s the basis for similarity? Are Mario and Maria similar despite being opposite genders? Or is Mario just similar to Marek? Is Mario similar to Marek at all? There’s no information.
I just had to find out for myself, so I installed it and tested the thing.
Installation
I tested this on an isolated environment via Homestead Improved with PECL pre-installed.
sudo pecl install gender
echo "extension=gender.so" | sudo tee /etc/php/7.1/mods-available/gender.ini
sudo phpenmod gender
pear run-scripts pecl/gender
The last command will ask where to put a dictionary. I assume this is there for the purposes of extending it. I selected .
, as in “current folder”. Let’s try it out by making a simple index.php
file with the example content from above and testing that first.
Sure enough, it works. Okay, let’s change the country to $country = Gender::CROATIA;
.
Okay, sure, it’s not a common name, and not in that format, but it’s most similar to Milena, which is a female name in Croatia. Let’s see what’s similar to Milena via similar.php
:
<?php
namespace Gender;
$gender = new Gender;
$similar = $gender->similarNames("Milena", Gender::CROATIA);
var_dump($similar);
Not what I expected. Let’s see the original, Milene.
So Milena is listed as a name similar to Milene, but Milene isn’t similar to Milena? Additionally, there seem to be some encoding issues on two of them? And the Croatian alphabet doesn’t even have the letter “y”, we definitely have neither of those similar names, regardless of what’s hiding under the question mark.
Okay, let’s try something else. Let’s see if Bob is an alias of Robert in alias.php
:
<?php
namespace Gender;
$gender = new Gender;
var_dump($gender->isNick('Bob', 'Robert', Gender::USA));
Indeed, that does seem to be true. Low hanging fruit, though. Let’s see a local one.
var_dump($gender->isNick('Tea', 'Dorotea', Gender::CROATIA));
Oh come on.
What about the Mario / Maria / Marek issue from the beginning? Let’s see similarities for them in order.
Not good.
A couple more tries. To make testing easier, let’s change the $name
and $country
lines in index.php
to:
$name = $argv[1];
$country = constant(Gender::class.'::'.strtoupper($argv[2]));
Now we can test from the CLI without editing the file.
Final few tries. I have a female friend from Tunisia called Manel. I would assume her name would go for male in most of the world because it ends with a consonant. Let’s test hers and some other names.
No Tunisia? Maybe it isn’t documented in the manual, let’s output all the defined constants and check.
// constants.php
<?php
$oClass = new ReflectionClass(Gender\Gender::class);
var_dump($oClass->getConstants());
No, looks like those docs are spot on. At this point, I stop my playing around with this tool.
The whole situation is made even more interesting by the fact that this is a simple class, and definitely doesn’t need to be an extension. No one will call this often enough to care about the performance boost of an extension vs. a package, and a package can be installed by non-sudo users, and people can contribute to it more easily.
How this extension, which is both inaccurate and incomplete, and could be a simple class, ended up in the PHP manual is unclear, but it goes to show that there’s a lot of cleaning up to be done yet in the PHP core (I include the manual as the “core”) before we get PHP’s reputation up. In the 9 years (nine!) since development on this port started, not even all countries have been added to the internal list and yet someone decided this extension should be in the manual.
Do you have more information about this extension? Do you see a point to it? Which other oddball extensions or built-in features did you find in the manual or in PHP in general?
Frequently Asked Questions about PHP’s Gender Extension
What is the PHP Gender Extension?
The PHP Gender Extension is a unique feature in PHP that allows developers to determine the gender of first names. It’s a useful tool for personalizing user experiences on websites and applications. The extension uses the gender.c library, which contains a database of names from various countries and cultures, along with their associated genders.
How do I install the PHP Gender Extension?
To install the PHP Gender Extension, you need to use the PECL extension installation command. This command is “pecl install gender”. After running this command, you should add “extension=gender.so” to your php.ini file. Remember to restart your server after making these changes.
How does the PHP Gender Extension work?
The PHP Gender Extension works by comparing the input name with its database of names and their associated genders. It then returns the gender associated with that name. The extension can also return a status of ‘unisex’ if the name is commonly used for both males and females.
What is the difference between is_unisex_name and is_gender_name in PHP’s Gender Extension?
The is_unisex_name function returns true if the name is commonly used by both males and females. On the other hand, the is_gender_name function returns true if the name is predominantly associated with one gender.
Can the PHP Gender Extension handle non-English names?
Yes, the PHP Gender Extension can handle non-English names. The gender.c library, which the extension uses, contains a database of names from various countries and cultures. However, the accuracy of gender determination may vary depending on the name’s origin.
Is the PHP Gender Extension always accurate?
While the PHP Gender Extension is a powerful tool, it’s not always 100% accurate. The accuracy depends on the comprehensiveness of the database and the cultural context of the name. For instance, a name considered unisex in one culture might be predominantly male or female in another.
How can I use the PHP Gender Extension to personalize user experiences?
You can use the PHP Gender Extension to personalize user experiences by tailoring content based on the user’s presumed gender. For example, you could use it to send personalized emails or display gender-specific content on your website.
Are there any ethical considerations when using the PHP Gender Extension?
Yes, there are ethical considerations when using the PHP Gender Extension. It’s important to remember that gender identity is complex and personal. Therefore, it’s crucial to use this tool responsibly and considerately, avoiding assumptions or stereotypes.
Can I contribute to the PHP Gender Extension’s database?
Currently, there’s no official way to contribute to the gender.c library used by the PHP Gender Extension. However, you can always contribute to the PHP community by sharing your experiences, insights, and suggestions on various forums and platforms.
Is the PHP Gender Extension available in all versions of PHP?
The PHP Gender Extension is not a built-in feature and needs to be installed separately. It’s compatible with PHP versions 5.3.0 and above. Always check the official PHP documentation for the most accurate and up-to-date information.
Bruno is a blockchain developer and technical educator at the Web3 Foundation, the foundation that's building the next generation of the free people's internet. He runs two newsletters you should subscribe to if you're interested in Web3.0: Dot Leap covers ecosystem and tech development of Web3, and NFT Review covers the evolution of the non-fungible token (digital collectibles) ecosystem inside this emerging new web. His current passion project is RMRK.app, the most advanced NFT system in the world, which allows NFTs to own other NFTs, NFTs to react to emotion, NFTs to be governed democratically, and NFTs to be multiple things at once.