DataMapper vs ActiveRecord

I know this is an old debate and ActiveRecord does have issues, which is why I was looking at switching to datamapper.

The problem is, there doesn’t seem to be any easy (efficient) way to get data with joins.

With my AR class, I can do:


$user = new User(123);
echo $user->blogs[4]->title;

and it will run the query:


SELECT *
FROM user
Inner Join blogs ON blogs.userId = user.id
WHERE user.id = 123
LIMIT 1 offset 4

(which gets performed at array access)

Or I for 1:1 mappings I can do unlimited chaining which results in a single query:


$blog = new Blog(21);
$blog->user->address->country();

Which automatically sorts out all the joins.

This produces readable, efficient code.

With 1 query, all the chained objects are filled out, and the query starts from the last filled out object, and it will ONLY join the table requested. With datamapper it looks like it would fetch all the related tables (looking at http://phpdatamapper.com/documentation/usage/table-relationships/ as an example)

For example, my user table joins to about 20 other tables. Joining all these every time I wanted a piece of information about the user would be horribly inefficient.

How can I do the same with datamapper? Is it even possible?

If not I’ll have to stick with AR, and I don’t know how anything other than a small system could use DataMapper…

First off - thanks for considering phpDataMapper as a possible solution. It has been a constant work-in-progress, and is still in beta for now.

Second, I have to say that I disagree with the approach of building a single query with multiple joins. This comes with the inherent assumption that the data source is the same for all your related data (MySQL or some other SQL-based RDBMS).

The approach I am taking, v1.0 release included (currently the dev/new branch), is that each relationship is handled by it’s own mapper. This results in a few more queries, but keeps the design much simpler and decoupled. It also opens up the possibility of relating data in your RDBMS to data from a NoSQL database, XML, or even a remote webservice.

Right now the main issue with the beta code is the N+1 query problem, which I am working to solve in the v1.0 release that I am working on pretty much every day now.

I don’t have an issue with the assumption that all the data comes from the same source, as this is the case in every project I’ve ever worked with. It is one of the problems with AR too. (and tehnically i could work around it because my AR class does obviously have an end-point where it fetches the data, then more data can be fetched too)

The issue I have with the approach of “a few more queries” is that over the course of the application it’s a lot more queries. It just seems incredibly inefficient. If you’re doing one query instead of one join that’s already a LOT of queries in some cases.

Unfortunately DB performance is not a sacrifice I can justify, as at least for one of my larger clients who run their entire business systems through their website, the load on their database server is often a bottleneck due to lots of large reporting queries, and being a very busy site with hundreds of thousands of page views per day (all of which interact with the database in some way). I could do it this way for other clients who have smaller sites, but I’d rather all my clients share the same code base.

Unless a DataMapper can give a similar level of performance I really can’t justify using it.

It depends on what “a few more queries” is actually doing, and the real performance cost to do so vs. a join. I work with databases every day that have millions of records per table, and most of the time additional queries are actually much faster than joins, especially with sorting. The number of queries is irrelevant - the performance and execution time of those queries is what you pay attention to.

I would fully expect a separate query to be issued for retrieving a user’s blog posts - they may be in an entirely separate database on a different network attached server. It makes no sense to join the two, because they are entirely different data sets or entirely different types.

Databases are always the bottleneck. You optimize the slow queries by whatever means you can (adding indexes, re-arranging joins, hand-tuning queries, etc), and add multiple caching layers to your application and hardware setup. This by nature is a manual process. No ORM will solve these issues for you, joins or not.

If you don’t think it will, then just don’t use it. Obviously what you have works for you and suits your needs. Why the arguments?

I agree, for complex queries which deal with a lot of data you have to use plain SQL if only for the sake of optimisation.

But for standard CRUD queries, where you’re generally dealing with less tables and far fewer records I’d rather use an ORM. It keeps the code cleaner and makes refactoring much easier. Performance is slightly less, but when you’re dealing with tiny data set’s it not nearly as much of an issue.

If you don’t think it will, then just don’t use it. Obviously what you have works for you and suits your needs. Why the arguments?

I suppose you’re right, but I do strive to improve my code and make it as good as possible. AR works, but there are flaws with it.

I used to think that the less queries the better and always tried to fetch as much data as possible with joins (MySQL). But after some time I observed that join performance is poorly optimized in many cases - at least in MySQL - and sending separate queries is faster. Sometimes a join will require setting up an additional index to run efficiently (which wouldn’t be needed otherwise) but sometimes even that doesn’t help.

I used to think that the less queries the better and always tried to fetch as much data as possible with joins (MySQL). But after some time I observed that join performance is poorly optimized in many cases - at least in MySQL - and sending separate queries is faster. Sometimes a join will require setting up an additional index to run efficiently (which wouldn’t be needed otherwise) but sometimes even that doesn’t help.

For starters, most developers use WHERE to perform JOIN’s which causes a performance hit on it’s own. Second, Any query with more than 2 JOIN’s is almost always over complicated, if you don’t mind writing a little more aggregate/constraint code at the PHP level, which is all round better for everyone.

I always write single queries, faster, easier to maintain and just plain sexier. :stuck_out_tongue:

Cheers,
Alex

When you’re grouping on columns from more than one table a single query is always going to be faster simply due to less data being passed around.

As for “Any query with more than 2 JOIN’s is almost always over complicated”… I hugely disagree for anything more than incredibly simple systems. I used to work at a marketing agency, one part of which was online surveys.

To display the result of a single survey, all these tables are needed:
-survey
-client [who’s running the survey] (joined to survey)
-user [the user who filled in the survey] (joined to survey)
-survey question (joined to survey)
-survey question answers [the possible answers for each question] (joined to survey question)
-user answers [the answers for each question by the user] (joined to user and question answers)

Are you honestly trying to say it would be better to do all the grouping and filtering in PHP?

Then there’s the ones which aggregated survey results for use in generating charts…

With data mapper we refer to “finders” and “load” methods. Finders query some database and get an array of data back from the database…

A finder method uses a loader method. A loader method basically takes a “blank” instance of your domain Model object, and the array from the finder. The loader then sets up the object with the data from the array.

Also in my implementation of data mapper I use “mappings”. By default each mapping does it’s own loading. You can set up all sorts of “relations” which does the equivalent of your first example. By the way this is called ripple loading.

In real world of using ORM you let the ripple loading happen, as you profile your app with real data you find the bottle necks and write finders for them (that would presumably do a JOIN instead of doing a SELCT statement on each object). Multiple cardinality joins are a bit more confusing and best described in POEAA.

Ideally all types of SQL should be encapsulated within a finder, the rest of your application should exclusively use a ‘finder’ method. The idea is all sql for a given module is contained within it’s mappers.

Most of the use I have found from data mapper has been from doing embedded value. Eg. I have a ‘concept’ like salary that could use it’s own object, but would not have it’s own database table since the ‘amount’ could be any arbitrary value. It is concepts like these that call for database schema to differ from the object schema.

The only time I ever had any performance issues is when I experimented having things be joined as the default strategy. I quickly went back to ripple loading as a default strategy. It got out of control because it was joining stuff you didn’t need. The ripple loading is not bad to deal with it all, and since it is data mapper, if it ever does become a problem I can just override the finder.

Another place data mapper really helped is doing stuff related to EAV for example. With data mapper its cool because the sky is the limit. If you want to use the value of a certain row’s field to derive the name of the class to instantiate, its no problem… just override the load() method. With active Record thats a little more difficult since you already choose the class name before you even executed the query :wink: Just one example of active record’s limitations

And the down side, you have an extra layer in the application. Everytime you need to add a new CRUD screen you have to write a mapper.

Grouping and filtering are operations of the database, so no - those do not belong in PHP. But I still disagree with your JOINS and your point that it’s faster. It’s not. Have you benchmarked it?

The main problem here to me is that you’re dumping completely different data sets together in the same bucket in the name of optimization. You’re thinking like a programmer instead of a DBA. You’re thinking about all the aggregate data you need instead of thinking of the different sets of data you need.

Consider this scenario:

  • Sort the questions by order set by admin (fairly obvious here)
  • Sort the answers to those questions randomly to avoid bias

With a JOIN, that scenario above will take FOREVER because random sorting in SQL sucks and is horribly inefficient and nasty looking, especially with large data sets. With separate queries, you can just array_shuffle the result set you get back for the answers and you’re done. More queries yes - but much faster and much easier both to build and to understand.

Don’t keep getting hung up on this whole “number of queries” concept. It’s completely irrelevant. Simple will always be fastest.

Agreed. I have had a 2 table join bring down a server. I have also had large #s of queries take down servers. It all comes down to what is involved. How well your indexes are, etc…

I can exceed 1,000 queries per page and still have sub-second response times. Mainly because they are all simple select statements using a primary key.

A join is called for in 2 places

  1. you have identified a real performance bottleneck that must be fixed. You have measured and are sure these queries are causing the slow load and must be joined
  2. you write procedural “top down” code, and you write stuff like $sql1, $sql2, $result1, $result2. In this case a JOIN is used to clean up the legacy code, but NOT for performance reasons

It is conceptually more difficult to write joins, especially one involving many tables. Also the more orthogonal your system is the more joins there could be. In some systems if you try to use joins everywhere you’d be creating 100s of times more joins then there are screens in the application. Kind of pointless if there is no performance issue to start with.

Ok ignoring query performance, with AR, I can do unlimited chaining:

For example:


$user = $this->data->createRecord('User', $id);
echo $user->orders[0]->products[0]->manufacturer->address->country;

Which produces readable, concise code.

Whether this is doing joins, or feching each record one at a time is irrelevant.

How do I achieve the same with DataMapper?

If it was:


$user = $this->userMapper->getById($id);
echo $user->orders[0]->products[0]->manufacturer->address->country;

This means the mapper, when it does the query has to fetch pretty much the entire database into the user object. It needs all the users orders, all the products related to all the users orders, all the manufacturers associated with those products, etc.

Obviously that’s feasible in the real world and would eat memory fast.

It seems to me the only way to do this with DataMapper is:


$user = $this->userMapper->getById($id);
$order = $this->orderMapper->getByUser($user->id)[0];
$product = $this->productMapper->getByOrder($order->id)[0];
$manufacturer = $this->manufacturerMapper->getById($product->manufacturerId);
$address = $this->manufacturerAddressMapper->getById($manufacturer->addressId);
echo $address->country;

Is this correct or am I missing something?

That produces horrible verbose code, which is clearly a OO interface to a relational back end.

Thanks :slight_smile:

Here’s the difference between the 2 patterns.

Data Mapper gives you a seam for where you can change the way the loading strategy works

Active Record would give you an “all or nothing” approach (If I change the load() method on a class it changes the way that class is loaded EVERYWHERE).

With data mapper I can use different strategies for different scenarios.

====================================================

Both patterns may implement lazy loading. You would only instantiate the mapper for the “root” object and could then traverse it’s related objects. This root object’s mapper gives you the encapsulation that you’d want to be able to go back and change the way the mapping for it’s component objects are wired up. Therefore a fluent “Active Record-ish” interface to it’s component objects is not a bad thing. Even though it seems like using an Active Record it is not because the root object’s mapper may decide to force a new loading strategy at any time.

Fowler outlines multiple lazy load patterns. The one my data mapper implementation uses is the one where I wrap a null object in a data aware “proxy wrapper”. When you call any of the methods on that wrapper it goes behind the scenes, finds the right mapper and calls the finder on it.

Now heres what you CANT do with Active Record. With Active Record you can’t go back and change that so it’s a join. Lets say you had your nice lazy loading traversal, right… what do you do when that slows to a halt from rippling?

While both patterns allow you to use a JOIN instead of using the lazy loading, it is more natural with data mapper (thats just my opinion).

See following for example:

<?php
class Shuffler_Collection_Lazy extends Shuffler_Collection
{
    /** @var Shuffler_Mapper_Mapping_Collection */
    protected $mapping;
    
    /** @var Shuffler_Model */
    protected $model;
    
    /** @var bool wether this lazycollection has been loaded ( initialized ) */
    protected $loaded = false;
    
    /** @var count of records */
    protected $count;
    
    /** @var integer */
    protected $added_before_load;
    
    protected $db_count;
    
    /**
    * @param Shuffler_Mapper_Mapping_Collection
    * @param Shuffler_Model
    */
    public function __construct( Shuffler_Mapper_Mapping_Collection $mapping, Shuffler_Model $model )
    {
        parent::__construct( array(), $mapping->getClass() );
        $this->mapping = $mapping;
        $this->model = $model;
    }
    
    /**
    * Adds a certain model from the collection
    * @param Shuffler_Model to add
    */
    public function addModel( Shuffler_Model $model )
    {
        if( !$this->loaded && !$this->hasModel( $model ) )
        {
            $this->added_before_load++;
        }
        return parent::addModel( $model );
    }
    
    /** @return ArrayIterator */
    public function getModels()
    {
        $this->load();
        return parent::getModels();
    }
    
    /**  @return bool */
    public function contains( $compare )
    {
        $this->load();
        return parent::contains( $compare );
    }
    
    /**
    * Get the model at a specific index in the results array
    * 
    * @param integer index
    * 
    * @return Shuffler_Model
    */
    public function getModelAtIndex( $index )
    {
        $this->load();
        return parent::getModelAtIndex( $index );
    }
    
    /**
    * @return Shuffler_Collection
    */
    public function getModel()
    {
        return $this->model;
    }
    
    /**
    * Removes a certain model from the collection
    * @param Shuffler_Model to remove
    */
    public function removeModel( Shuffler_Model $model )
    {
        $this->load();
        return parent::removeModel( $model );
    }
    
    /**
    * Get # of models this collection currently has
    * 
    * @return int count
    */
    public function count()
    {
        if( false === $this->loaded && isset( $this->count ) )
        {
            return $this->added_before_load + $this->count;
        }
        if( false === $this->loaded )
        {
            return $this->added_before_load + $this->getDbCount();
        }
        else
        {
            $this->count = parent::count();
        }
        return $this->count;
    }
    
    public function load()
    {
        unset($this->count);
        if( false === $this->loaded )
        {
            $this->loaded = true; 
            $this->merge( $this->mapping->loadCollection( $this->getModel() ) );
        }
    }
    
    protected function getDbCount()
    {
        if( !isset( $this->db_count ) )
        {
            $this->db_count = $this->mapping->countCollection( $this->getModel() );
        }
        return $this->db_count;
    }
}

This class just wraps my normal collections. Instead of loading all the data when a collection is needed, this “lazy” one is injected instead. When you try to call any of its methods it triggers the loading with $this->load() which loads the models in that collection.

A more efficient way, like you said would be to load only $orders[0] instead of the whole $orders array. At least this way you aren’t loading every order’s item, and each of those item’s customers, and all those customers orders, and all those order’s items, and all those items customers, ad infintum :wink:

The “count” logic is a little complex because one of it’s features is that I can add models to it before it has done its lazy loading, then later when It does its lazy loading it takes into account both models that I added before load, and the models it pulled from the database post load.

Of course if it ever did become a problem I could override specific finders as needed, to aggressively JOIN instead of lazily inject these ghost collections.

If your active record framework uses “finders” and supports these features, you should ask yourself if the framework’s maintainers know what patterns they are using :wink: Nothing wrong with a hybrid approach but there are some Apples out there pretending to be Oranges if you get what I mean, and vice versa

Pretty much the exact same way. I don’t see how there would need to be a difference.

You’re making assumptions about what a DataMapper can do and what it can’t. Nothing says a DataMapper couldn’t return a relation object that implements ArrayAccess that would just return the ‘0’ indexed row that you asked for. It would not have to (and shouldn’t) return the whole set.

And again with the joins - Yes, the DM in this case would be executing 5 queries. But I would rather take 5 simple queries over a single query with 5 joins any day. Again, the simpler the better. 5 simple queries on indexed columns scales far better than a multi-table join.

That would be one way to do it, yes. But like I said above, nothing is stopping the DataMapper from letting you chain in the same way. It’s all about the specific implementation. With the mapper relations are defined, the loading and chaining would be automatic, just like with your ActiveRecord.

http://phpdatamapper.com/documentation/usage/table-relationships/

Czaries, I agree with what you wrote, but the link you pasted is not data mapper. It shows the fields/mappings are being set up on the model.

Reminds me of the MVC thread someone posted where the view and controller were in the same file (calling something somethings its not). The idea of data mapper is not to change the interface you use, its to uncouple the model from the persistence framework :wink: Your solution gives me a $mapper object to use but the mappings are still coded into the Model’s class, making impossible to have multiple mappers per model. Not the point of data mapper. Although a neat idea I do admit.

The goal of data mapper is to be able to “cleanly” re-use your models with different persistence frameworks, different “mappers” within each of those frameworks, different tables, different loading strategies, etc… that with active record would result in an explosion of sub-classes or deeply nested if statements in the loading code.

It’s just an issue of naming. Both of our DataMapper implementations are setup very similarly, but mine only has a Mapper, not a Model + Mapper like yours. In my setup, the ‘phpDataMapper_Model’ is actually the Mapper, and the name has already been updated in the new development branch I have been working on (now ‘phpDataMapper_Base’). I know what a DataMapper is, what it means, and I too follow the Fowler definition.

In my implementation, I just provide the mappers to use - the actual “Model” layer is left completely up to the developer, so there can certainly be multiple mappers per model if the developer chooses. The only current limitation is a 1:1 Mapper-Table mapping, which I see is a limitation of your implementation as well.

True. Fair enough. You can can certainly fault me for bad naming too, lol. Another gripe I have with yours is the term “row object”. There should be no such concept. By the way no download link on your website tsk tsk tsk. Not very “open” :slight_smile:

That has been fixed as well. It’s now an “Entity”. :slight_smile:

And there is a download link, it’s in the getting started section. And yes, I have a lot of work to do both on the website and documentation. I have plans to launch the site on a new platform here in about a month or two that will make all that easier. Using WordPress as a CMS is very limiting.