SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Zealot
    Join Date
    Feb 2008
    Posts
    109
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Modelling a SERP tracker

    Hi all,

    Bit of a broad question Iím afraid, but Iím looking for some advice on how to go about modelling a search engine results tracker.

    Iím mainly interested in the relationship between collection / item objects and data mappers. Iím assuming that the data in this example can effectively have two sources Ė direct search engine results pages and local database storage Ė this is an appropriate use case for the data mapper pattern?

    How would you go about getting two data sources to interact?

    Currently Iím thinking along the lines of:

    PHP Code:
    [HIGHLIGHT="PHP"]
    Class 
    ResultsScraper {

        public function 
    get(Url $url) {
        
        
    // Grab the results page
        
        
    return new ResultsCollection($resultsDOM);
        }
    }

    Class 
    ResultsCollection implements Seekable Iterator {

        public function 
    __construct(DOMDocument $results) {
            
    $this->mapper = new DOMMapper();
        }

        public function 
    current() {
            return 
    $this->mapper->get($currentDOMNode);
        }
    }

    Class 
    DOMMapper {
        
        public function 
    get(DOMNode $node) {
        
            
    $resultObject = new Result();
        
            
    // Parse node and map to $resultObject

            
    return $resultObject;
        }
    }

    Class 
    DatabaseMapper {
        public function 
    insert(Result $result) {
            
                        
    // Parse through $result and add to database

            // Populate a ďstorage IDĒ property within Result Object

            
    return $result;
        }
    }

    // Client Code

    $DatabaseMapper = new DatabaseMapper();

    $scraper = new ResultsScraper();

    $results =  $scraper->get($url);

    foreach(
    $results as $result) {
        
    $mapper->insert($result);
    }[/
    HIGHLIGHT
    Rather crude code Iím afraid but hopefully it serves as an illustration. Any suggestions / tips / advice appreciated.

    Cheers,

    DM

  2. #2
    SitePoint Addict
    Join Date
    Feb 2007
    Posts
    251
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    What's throwing me is the DOM stuff in the collection class. Why should a collection have that? Why not give the Result object specific properties like $title, $url, $searchEngineId, $searchDate, $appearedOnPage, $atPosition, etc., and then have a mapper populate those properties from a DOM tree or database rows?

  3. #3
    SitePoint Addict
    Join Date
    Feb 2007
    Posts
    251
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Actually, after looking over the code again, that's not what confused me. I'm actually rather sick and woozy at the moment, so I'll abandon this thread for the time being.

  4. #4
    SitePoint Zealot
    Join Date
    Feb 2008
    Posts
    109
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cuberoot View Post
    What's throwing me is the DOM stuff in the collection class. Why should a collection have that? Why not give the Result object specific properties like $title, $url, $searchEngineId, $searchDate, $appearedOnPage, $atPosition, etc., and then have a mapper populate those properties from a DOM tree or database rows?
    Thats what I was planning on doing, hence the dilema with whether the Collection object should be aware of the Mapper.

    As you suggest, I want to maintain a single "domain" object, that contains the properties you mention - this will either be populated from a database or from a DOMDocument (derived from a Web get).

    I don't really want to couple collections with mappers - but I'm not sure how to avoid it efficiently.

    I'd also appreciate any suggestions on managing domain objects once they've been instantiated - if the collection has no knowledge of the mapper how would you go about storing an instantiated object (somewhere) to save loading it again?

    Quote Originally Posted by cuberoot
    Actually, after looking over the code again, that's not what confused me. I'm actually rather sick and woozy at the moment, so I'll abandon this thread for the time being.
    Don't forget to pop back I'm feeling woozy through confusion! Just kidding, take the day off and feel better soon!

    Cheers,

    DM

  5. #5
    SitePoint Addict
    Join Date
    Feb 2007
    Posts
    251
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Confused, would we?

    I'd like to take the day off, but alas, my supply of sick days is running low. Argh! (And suddenly I'm a pirate?)

    First strategy I can think of is having your mappers return collections. The mappers could also maintain identity maps to track objects in memory. The mapper could simply check the identity map for an object prior to querying the datasource. It would then go about updating the identity map, building the collection, etc.

    But then you run into the problem with lazy loading, since you obviously don't want to load 10,000 objects into memory just to process them one at a time. I'm assuming this is why you've implemented the iterator interface in the collection class.

    However, this really doesn't apply in the case of a DOM document, since the whole thing presumably resides in memory from the get go.

    Also, Result objects coming from scraping sessions differ from results in the database in that they fluctuate with time. Given the same URL, the results are most likely going to be different.

    You can think of the scraper as a type of factory. It instantiates new objects based on some criteria, but it's entirely unrelated to persistence.

    The mapper on the other hand is in charge of persisting new objects and loading them from a persisted state.

    Considering all this, I'd probably have the scraper return a simple array of Result objects and then pass them over to the mapper for persistence. The mapper can dress up the incoming objects however it sees fit.

  6. #6
    SitePoint Zealot
    Join Date
    Feb 2008
    Posts
    109
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cuberoot View Post
    First strategy I can think of is having your mappers return collections. The mappers could also maintain identity maps to track objects in memory. The mapper could simply check the identity map for an object prior to querying the datasource. It would then go about updating the identity map, building the collection, etc.

    But then you run into the problem with lazy loading, since you obviously don't want to load 10,000 objects into memory just to process them one at a time. I'm assuming this is why you've implemented the iterator interface in the collection class.
    Definately, this is what I'm planning - if anything the identity map would be best kept by the collection?

    Quote Originally Posted by cuberoot View Post
    However, this really doesn't apply in the case of a DOM document, since the whole thing presumably resides in memory from the get go.

    ...

    You can think of the scraper as a type of factory. It instantiates new objects based on some criteria, but it's entirely unrelated to persistence.

    The mapper on the other hand is in charge of persisting new objects and loading them from a persisted state.
    Interesting perspective, I was thinking of the scraping source, i.e. search results, as a form of persistant 'web storage' which is what lead me on to thinking it needed its own mapper, albeit one way - to instantiate the Results domain object.

    Quote Originally Posted by cuberoot View Post
    Considering all this, I'd probably have the scraper return a simple array of Result objects and then pass them over to the mapper for persistence. The mapper can dress up the incoming objects however it sees fit.
    This is basically what my I intended for the DOMResultCollection outlined in the code, but rather than use an array I simply parsed the scraped HTML into a DOM object and then used the DOMResultCollection, implementing iterator, to parse out each Result (as a DOM element) when its needed.

    It's at this point that it gets confusing to me - I have two representations of one domain object effectively i.e. a Result object can be populated from data from a DOMNode, or from a database row. I think I wrongly assumed this meant two mappers?

    Maybe a simple alternative is to populate the Result object from the DOMResultCollections current() method - then it looks after its own 'mapping' or object creation. It could even implement its own identity map if the collection is looped several times.

    Ok, public ramblings over - does this make even the slightest bit of sense?

    DM

  7. #7
    SitePoint Addict
    Join Date
    Feb 2007
    Posts
    251
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    What about this?

    PHP Code:
    $result = new Result('my''fake''result'); 
    You can do that, and it doesn't mean you have another representation of your domain model running around hitting old ladies in the head. What I mean is, this seems to be less about representation than it is about instantiation.

    A mapper probably isn't the word to use for getting the results from web pages into the application, since a mapper really is all about persistence, and persistence is a two way street.

    I think your DOMMapper class is really a factory. It just takes a DOM node and translates it into a Result object. This object doesn't really have an identity within the domain model. It could be a throwaway object for all anybody knows.

    Another point to consider is that there is no true mapping between your object and the URL in question, since the content at the URL changes in time. Unless of course you can pass a time parameter in the URL and get the same version of the document.

    Regarding the identity map, until the object has been persisted by the application, it doesn't make much sense to have one. As for where to put it, I'd really try for the mapper first. A collection isn't really about persistence, so the more distance, the better.

    Then again, since this is a factory, you don't really need to worry about tracking identities, right? Or is there a good reason?

  8. #8
    SitePoint Zealot
    Join Date
    Feb 2008
    Posts
    109
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by cuberoot View Post
    What about this?
    PHP Code:
    $result = new Result('my''fake''result'); 
    You can do that, and it doesn't mean you have another representation of your domain model running around hitting old ladies in the head. What I mean is, this seems to be less about representation than it is about instantiation.
    I can see that your right it is all about instantiation, and the more I think about it this is more what a factory is for. I'm not entirely clear about what you mean by "less about representation" ?

    I considered setting up the Result object via the constructor, maybe passing an array of values in would work - for some reason it didnt 'feel' right. Sounds a bit irrational thinking back now lol.
    A mapper probably isn't the word to use for getting the results from web pages into the application, since a mapper really is all about persistence, and persistence is a two way street.

    Quote Originally Posted by cuberoot View Post
    I think your DOMMapper class is really a factory. It just takes a DOM node and translates it into a Result object. This object doesn't really have an identity within the domain model. It could be a throwaway object for all anybody knows.
    Completely agreed on the first point, and as you mention its not a true two way mapping relationship and I didnt really consider the temporal factors.

    How do you mean its not part of the domain model though? Which part isnt? If the DOM node then sure, its just a convenient tempory storage format, much like an array.

    Quote Originally Posted by cuberoot View Post
    Regarding the identity map, until the object has been persisted by the application, it doesn't make much sense to have one. As for where to put it, I'd really try for the mapper first. A collection isn't really about persistence, so the more distance, the better.

    Then again, since this is a factory, you don't really need to worry about tracking identities, right? Or is there a good reason?
    I take your point about the collection not being concerned with persistance. I don't really have a good reason or use case, so thats probably a good sign that its a bad idea!

    When I was thinking of an identity map, I was really just thinking of a way to avoid instantiating result objects multiple times when looping collections. i.e. a reference to the Result object is stored within the collection once instantiated and returned in preference to creating a new one.

    Thanks for your comments, much appreciated. They've really helped me see how I was over complicating thing.

    Cheers,

    DM

  9. #9
    SitePoint Addict
    Join Date
    Feb 2007
    Posts
    251
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by DangerMouse1981 View Post
    I can see that your right it is all about instantiation, and the more I think about it this is more what a factory is for. I'm not entirely clear about what you mean by "less about representation" ?
    It was related to something you said previously...

    It's at this point that it gets confusing to me - I have two representations of one domain object effectively i.e. a Result object can be populated from data from a DOMNode, or from a database row.
    I was just saying that the records and nodes aren't "true" representations of the domain object. The representation is the Result object itself, which is obviously one thing, although you can instantiate it in a variety of ways.

    How do you mean its not part of the domain model though? Which part isnt? If the DOM node then sure, its just a convenient tempory storage format, much like an array.
    Yeah, that's all I meant. I chose my words poorly.

    When I was thinking of an identity map, I was really just thinking of a way to avoid instantiating result objects multiple times when looping collections. i.e. a reference to the Result object is stored within the collection once instantiated and returned in preference to creating a new one.
    If there's no value outside of saving a little instantiation time, then I'd just put it off until you notice a problem with performance. You'd have to profile the code and make sure it was this bit of code that was the bottleneck of course.

    Then again, saving time in instantiation will also increase memory consumption. It may be the case that you'd want to take the performance hit to save memory instead. It all depends on the size of your collections obviously.


Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •