SitePoint Sponsor

User Tag List

Page 1 of 2 12 LastLast
Results 1 to 25 of 36
  1. #1
    SitePoint Zealot ZangBunny's Avatar
    Join Date
    Jul 2003
    Location
    Mainz, Germany
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Lightbulb Finding Duplicated Code (Targets for Refactoring)

    A chapter in "Object-Oriented Reengineering Patterns" mentioned a tool named "Duploc" written in and for Smalltalk that helps you locate duplicated code by comparing source files in a "dotplot".

    If found the concept intriguing, so I threw together a PHP version of Duploc. It works by first cleaning up the code, removing "noise" like whitespace, variable names, and string constants. Then it does a plot of the comparison by putting a dot wherever two lines look identical. Here is an example of a short PHP file compared to itself:

    The diagonal line of course means that every line is identical to itself.
    Now look at the following example, showing the file "assertion.php" from lastcraft's Simpletest package:

    The lose rectangular pattern in the top left quadrant points to a control structure, like a list of if/elseifs or a switch statement which could lend itself to a "replace conditional with polymorphism" refactoring, so it's worth looking into.

    Much more interesting is the bottom right quadrant: The diagonal lines show sequences of lines that are virtually identical. Broken lines are almost identical, with only some lines changed between the two. These point us to duplicated behaviour or even cut'n'paste programming. If I where refactoring, I would definitely have a look at this.

    The last example comes from voostind's "Eclipse" library. This is normally what I point people to for an example of nicely factored OOPHP code, but comparing any of the database classes, you find this: (The example show PgDatabase.php and MyDatabase.php)

    The broken diagonal shows us a case of cut'n'paste programming: Obviously one file was just copied over and changed to match the other database. It becomes even more obvious comparing MsDatabase and MyDatabase:


    Currently, you have to drop two files into the directory you want to work in and the tool gives you a large NxN table, comparing each file in the directory with every other. If there's any interest in this, I might consider putting some more work in and releasing it. So what do you think?
    Last edited by ZangBunny; Nov 15, 2003 at 09:34.

  2. #2
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

  3. #3
    SitePoint Zealot ZangBunny's Avatar
    Join Date
    Jul 2003
    Location
    Mainz, Germany
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Selkirk
    Yep, that was the tool I mentioned. Sadly, the Smalltalk version needed to run it isn't availlable any more, so I could not run it. I don't know smalltalk, so studying the source doesn't do much for me...
    Quote Originally Posted by Selkirk
    Finds copy'n'paste code even in PHP, but is unwieldy for large numbers of files and doesn't handle slight variations in the code well (like different variable names.)
    Quote Originally Posted by Selkirk
    Text output doesn't lend itself to a quick overview. Charging 150$ per project for this is outragious. Most of the fuzzy matching rules they use are already in my Duploc version (without knowing simian.)

  4. #4
    SitePoint Enthusiast
    Join Date
    Aug 2003
    Location
    Watford, UK
    Posts
    62
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    This seems like a nice tool - I'd definitely be interested in having a play with it. A tool for PHP that can help identify targets for refactoring automatically could be really useful (now that I'm getting more of a handle on refactoring).

    p.s. the images make me think of reading tealeaves .

    Cheers,

    Jon

  5. #5
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by ZangBunny
    Now look at the following example, showing the file "assertion.php" from lastcraft's Simpletest package:

    The lose rectangular pattern in the top left quadrant points to a control structure, like a list of if/elseifs or a switch statement which could lend itself to a "replace conditional with polymorphism" refactoring, so it's worth looking into.
    Fantastic!

    I have never seen such a tool. What is also interesting is that assertion.php was the very next file that got refactored. You got it spot on. If you check the latest CVS you will see that I am currently splitting it into expectation.php and dumper.php, that is one for comparison and one for displaying the results. It was indeed a cut and paste hack as it was mid-refactor.

    Other points...

    1) The core problem in the code was a static method call preventing polymorphism. This was enough to get the feature in quickly and the tests passed. Of course, just because tests pass doesn't mean you are finished. You still have to clean up the code. This tool could have been used to generate a metric that told me to keep going.

    2) The case statement (actually if...else) should only really be replaced with polymorphism if you get two. A single switch is clearer than lot's of clases with only one method. I think this is a false positive. If there were two such blocks, generating four of these boxes in a square formation, I would be more worried. That scenario is going to be hard to spot unless you include "very close" lines as well as exact matches.

    3) The diagram is not going to be so useful over multiple files. A matrix of closeness of all the files in the project compared with each other could highlight danger points. You could then generate the diagram for these files only.

    I am really interested in this. Are you going to Sourceforge it?

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  6. #6
    SitePoint Zealot ZangBunny's Avatar
    Join Date
    Jul 2003
    Location
    Mainz, Germany
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lastcraft
    I think this is a false positive. If there were two such blocks, generating four of these boxes in a square formation, I would be more worried. That scenario is going to be hard to spot unless you include "very close" lines as well as exact matches.
    The tool is intended only as a "quick overview" to find places worth looking at. I haven't figured out all "interesting patterns" yet, but I guess it's best used like "tea leaves". If you know the code, and stare at the picture long enough, you'll come up with something useful. And yes, it does include "close matches" not only exact.

    Quote Originally Posted by lastcraft
    3) The diagram is not going to be so useful over multiple files. A matrix of closeness of all the files in the project compared with each other could highlight danger points. You could then generate the diagram for these files only.
    On the contrary. I find it very useful to get a quick overview over a large number of files. I'll attach a screenshot displaying a plot of the complete eclipse library. In "real life" you'll see the filenames when you mouse over the individual images.
    Quote Originally Posted by lastcraft
    I am really interested in this. Are you going to Sourceforge it?
    I guess so. In the meantime I'll clean up the code and post it here, since its only 150 lines at present.
    Attached Images Attached Images

  7. #7
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by ZangBunny
    And yes, it does include "close matches" not only exact.
    Could you colour code the matches, brilliant white for exact and then shades of grey for proximity?

    Quote Originally Posted by ZangBunny
    On the contrary. I find it very useful to get a quick overview over a large number of files.
    The project I am currently employed on is 30000 lines of code and 200+ files. Also a metric would be really helpful as it could be built into the testing/release mechanism.

    Could you do a plot of the whole of SimpleTest . Just curious...

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  8. #8
    SitePoint Zealot ZangBunny's Avatar
    Join Date
    Jul 2003
    Location
    Mainz, Germany
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lastcraft
    Could you colour code the matches, brilliant white for exact and then shades of grey for proximity?
    Not at the moment. I have an idea, but that would slow the thing way down.

    Quote Originally Posted by lastcraft
    The project I am currently employed on is 30000 lines of code and 200+ files.
    Lucky you. I have about 4x as much to dig through.

    Quote Originally Posted by lastcraft
    Also a metric would be really helpful as it could be built into the testing/release mechanism.
    What kind of "metric" did you have in mind? Number of duplicated lines? Largest continous duplication? I'm stumped.
    Quote Originally Posted by lastcraft
    Could you do a plot of the whole of SimpleTest . Just curious...
    Yes, did that already, but I can't get a screenshot as it is about 6 screenfulls big. You can do it yourself, though, as I'm attaching a pre-alpha version to tide you over until SF approves the project.
    Attached Files Attached Files

  9. #9
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by ZangBunny
    What kind of "metric" did you have in mind? Number of duplicated lines? Largest continous duplication? I'm stumped.
    Actually so am I . Number of duplicated lines (minus the diagonal in the middle) might be useful and reasonably quick. Have you tried this? Also the length of the matching segment is crucial. If it is threshholded at two lines minimum maybe. Hm...I am going to have a play...

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  10. #10
    SitePoint Enthusiast
    Join Date
    Aug 2003
    Location
    Watford, UK
    Posts
    62
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    Before I lose the link, this PDF (the paper is cited on the Smalltalk Duploc site) may help in interpreting the output:

    http://imagebeat.com/dotplot/tapos.pdf

    Cheers,

    Jon

    edit:
    It took until I saw the link printed in this post before I thought to try the directory, rather than just the pdf, looks like there's some interesting stuff there.

  11. #11
    SitePoint Zealot ZangBunny's Avatar
    Join Date
    Jul 2003
    Location
    Mainz, Germany
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Nice. Pretty scientific, but with a little efford I can probably get something useful out of it

    I applied for Sourceforge hosting for PHP:uploc last night, so I should hear back from them early next weak, I hope.

  12. #12
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by ZangBunny
    I applied for Sourceforge hosting for PHP::Duploc last night
    You'll get it somewhen on monday. They are usually very quick to approve projects. Want some help?

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  13. #13
    SitePoint Zealot ZangBunny's Avatar
    Join Date
    Jul 2003
    Location
    Mainz, Germany
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by lastcraft
    Want some help?
    Sure. Should I just add you as a project member?

  14. #14
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by ZangBunny
    Should I just add you as a project member?
    Yeah, please.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  15. #15
    SitePoint Enthusiast
    Join Date
    Aug 2003
    Location
    Watford, UK
    Posts
    62
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    I've been keeping distracted with this for most of the afternoon now . I'm attaching a simple viewer app that lets you display the relevant source code lines of the files by clicking on the dotplot (it only works for file against file). It's JavaScript based, and I'm sure it's pretty brittle - it's working for me in Mozilla Firebird, but I haven't got enough JavaScript to know how it'll do in other browsers.

    It should work by unzipping into a web dir and hitting dotplotviewer.html, entering the relative path(s) to the file(s), and submitting the form. It's a bit ugly, but it works for me.

    Unfortunately I kind of got a little carried away w/ mods to dotplot.php - I added scaling (it's kind of important for clicking on the right dot... ), but I also made a few other changes that weren't that necessary and could well have slowed it right down/outright broken it. Memory use is a problem w/ the scaling.

    Anyway, hopefully the attached will give you some ideas if nothing else.

    Cheers,

    Jon
    Attached Files Attached Files

  16. #16
    SitePoint Guru
    Join Date
    Nov 2002
    Posts
    841
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I like where this thread is going.

    I've always had a fondness for alternate tools for viewing/analyzing source code: cross referencers, metrics calculators, lint, etc.

    I would love to see a useable PHP lint. With messages like:

    WARNING: Variable $error (1 occurrence) is a variant capitalization of variable $Error (23 occurrences).

    I despise tools that create source code from diagrams. I love tools that create diagrams from source code.

    My brother (David Moore) is a top ranked corewar player. The various corewar red code simulators have visual representations of the running programs, which David has learned to interpret and debug with. This tool reminds me a bit of the red code simulator display. Very cool stuff.

  17. #17
    SitePoint Zealot ZangBunny's Avatar
    Join Date
    Jul 2003
    Location
    Mainz, Germany
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by JonR
    I've been keeping distracted with this for most of the afternoon now ;). I'm attaching a simple viewer app that lets you display the relevant source code lines of the files by clicking on the dotplot (it only works for file against file). It's JavaScript based...
    That was actually the next step I was thinking about, although I was more along the lines of the old "server side image maps" ("ismap" attribute in the <img>-tag) to get the click coordinates.
    What I'd use JS for is displaying the two source files side by side in individual frames. Alternatively, you can take some context and put them side-by-side in a table...

    Quote Originally Posted by JonR
    Unfortunately I kind of got a little carried away w/ mods to dotplot.php - I added scaling (it's kind of important for clicking on the right dot... ;)), but I also made a few other changes that weren't that necessary and could well have slowed it right down/outright broken it. Memory use is a problem w/ the scaling.
    Scaling can be done with imagecopyresized() after all the actual work is done and the memory is freed. I really like the different coloring for exact matches. Haven't looked at the performance impact yet.

    The paper mentioned above suggests a weighing algorithm, where matches between very common constructs (like "break" or "}") have less weight (darker pixel) than lines that appear very rarely. They also point out some more patterns to look for.

    If you're interested, PM me your Sourceforge handle so I can put you on the roster.

  18. #18
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi.

    I had a little play before Sunday telly and before my cold got the better of me . 30000 lines of source takes nearly ten minutes on my twin P3-600. This is quite acceptable with progressive painting of teh images. It was a bit hit and miss on finding bad files though. I think the shading/metric factor is actually quite important and I was a little dissapointed after the direct hit on SimpeTest. I am going to play with some alternates and repost.

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  19. #19
    SitePoint Enthusiast
    Join Date
    Aug 2003
    Location
    Watford, UK
    Posts
    62
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    Quote Originally Posted by ZangBunny
    I was more along the lines of the old "server side image maps" ("ismap" attribute in the <img>-tag) to get the click coordinates.
    That sounds like the more sensible option.

    Quote Originally Posted by ZangBunny
    Scaling can be done with imagecopyresized() after all the actual work is done and the memory is freed.
    Ahhh, I'd never used the image stuff before. I had a quick scan for something similar, but it didn't jump out at me. Oh well, learn something new everyday (well several things today actuallly ).

    Quote Originally Posted by ZangBunny
    The paper mentioned above suggests a weighing algorithm, where matches between very common constructs (like "break" or "}") have less weight (darker pixel) than lines that appear very rarely. They also point out some more patterns to look for.
    There's a fair bit of cool stuff at that site, unfortunately a lot of it is straight over my head on the first reading .

    I don't know if the inverse frequency thing that they talk about will work on a micro-scale, file vs. file (or vs. itself). Having said that, I guess that macro-scale runs might be a good way to build up weightings for use in the micro-scale.

    Cheers,

    Jon

  20. #20
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Did a bit of refactoring prior to adding other display types. I was hoping to do more dramatic stuff, but it's getting late. I have optionally removed the central line (and speeded up the symmetric images) and added links as a decorator.

    yours, Marcus
    Attached Files Attached Files
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  21. #21
    SitePoint Zealot ZangBunny's Avatar
    Join Date
    Jul 2003
    Location
    Mainz, Germany
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Just got approved by Sourceforge.

    Find everything related to this project at http://www.sourceforge.net/projects/phpduploc

  22. #22
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    I've started using it at work today.

    One problem is that unifying/normalising the source shrinks the number of lines. This is going to make it difficult to highlight in the original source.

    Quote Originally Posted by ZangBunny
    Just got approved by Sourceforge[/url]
    Excellent. Done some metric experiments and came up with the version below.

    This brings up the thorny question of whose contribution to use as the base release . Please at least have a look at this version as I would hate do have to redo everything I've just done. To change the bahaviour of the one below you have to edit duploc.php to assemble the objects differently.

    yours, Marcus
    Attached Files Attached Files
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  23. #23
    SitePoint Enthusiast
    Join Date
    Aug 2003
    Location
    Watford, UK
    Posts
    62
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi,

    Quote Originally Posted by lastcraft
    I've started using it at work today.
    Likewise, it's cool that it's so immediately useful.

    Quote Originally Posted by lastcraft
    One problem is that unifying/normalising the source shrinks the number of lines. This is going to make it difficult to highlight in the original source.
    I altered this bit for the viewer thing for that very reason.

    Quote Originally Posted by lastcraft
    Done some metric experiments and came up with the version below.
    I had a quick look at your original posting this morning, it seemed very fast already . I'll have a look at the new one later.


    I was thinking about other possibly useful targets that might be fairly easy to check for (separately from the dotplot):

    Long Method (and/or cyclomatic complexity)

    Large Class

    Long Parameter List

    Switch Statements (should be caught by the dotplot)

    Comments (those that aren't for PHPDoc anyway)

    Sprinklings of TODO, FIXME, HACK etc.

    I guess that most of these would probably be fairly easy to spot. Weighting their importance might be another matter...

    Cheers,

    Jon

  24. #24
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi.

    As for speed, I had a version that just listed the diagonal plots against file name. A kind of vertical report. I didn't include it with the source I posted, but that is the one I have now taken to using. For that, fast symmetric plots were very useful.

    As for more features, you could add extra modules as links from the grid pattern or as modules for a vertical report.

    I think the most useful module is a source code listing with the lines in different colours and shades. Say shades of red for intrafile repetition and shades of orange for extrafile repetition. You really want to see some kind of report (including the dot plot) when you click on a file grid report.

    Also I am now not so sure about taking out the central diagonal. It is a useful hook for a beginner to understand what is going on.

    yours, Marcus.
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things

  25. #25
    ********* Victim lastcraft's Avatar
    Join Date
    Apr 2003
    Location
    London
    Posts
    2,423
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hi...

    Quote Originally Posted by ZangBunny
    Find everything related to this project at http://www.sourceforge.net/projects/phpduploc
    What's the module name?

    yours, Marcus
    Marcus Baker
    Testing: SimpleTest, Cgreen, Fakemail
    Other: Phemto dependency injector
    Books: PHP in Action, 97 things


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •