How to remove a dynamic frame buster

I’m trying to crawl a page without an API by fetching the markup and inserting the crawling script into it but the page author has some frames that dynamically rig in the frame busting snippet. I know this because I’ve tried the code in this SO answer but the deed continues. I want to impede the activity from the frame buster and scoop in my crawling script but I don’t know why the following code doesn’t get any of the elements. First, I tried going with the outline in the answer I posted earlier, like so

<script type="text/javascript" id=busty>
function ignore_next_redirect() {
  var redirect_timer;
  var prevent_bust = 0  
  window.onbeforeunload = function() { prevent_bust++; }  
  redirect_timer = setInterval(function() {  
    if (prevent_bust > 0) {  
      window.top.location = 'stable.php'  // sets status code to 204
      window.onbeforeunload = function() {}
      clearInterval(redirect_timer);
      var bad = [document.getElementsByTagName('iframe'), document.getElementsByTagName('script')]
   for (var i = 0; i < bad.length; i++) {
    Array.prototype.filter.call(bad[i], (e) => !e.matches('#busty')).forEach(function(e){
e.parentNode.removeChild(e)
   })
   } // remove generating iframes

    var jq = document.createElement('script')

    jq.setAttribute('src', 'jquery.js')

    document.getElementsByTagName('body')[0].appendChild(jq)

    $.get('resume.php') // revert to 200

    $.getScript('newCrawler.js'); // add the crawling script

    $('#busty').remove()
    }  
  }, 1);}
</script> 

But the elements are still sitting there comfortably. Then I tried this variant

<script type="text/javascript" id=busty>
    var prevent_bust = 0,

    stable = 0;

    window.onbeforeunload = function() {
        prevent_bust++;
    }

    setInterval(function() {  
      if (prevent_bust > 0) {  
        prevent_bust -= 2  
        if (!stable) top.location = 'stable.php'

        else $.get('resume.php')

        $('iframe, script, noscript').not('#busty, [src="jquery.js"], [src="newCrawler.js"]').remove() 
      }
    }, 1)

    var jq = document.createElement('script')

    jq.setAttribute('src', 'jquery.js')

    document.getElementsByTagName('body')[0].appendChild(jq)

    $.get('resume.php')

    stable = 1

    $.getScript('newCrawler.js');

    $('#busty').remove()
</script> 

I don’t understand why

  • the elements are still not removed
  • the ajax requests I’m making (i.e to resume normal flow and get the crawling scripts) do not run

PS: The script tags are on purpose; They’re replacing the whitespace in the original markup.

The site owner may not want people copying their content?

2 Likes

Who is copying their content? Do you not know what an API is? Their site contains some data that they put up to be beneficial to visitors. What difference does it make if the visitors accessed the data using a pen and paper or if they had a program do it for them (so they could use it wherever and however they deem fit)? If they didn’t want anybody using their datas, then why is it there?

Now, precious time that would have been used to remedy the situation at hand will be used to address an irrelevant affair. Great! :smiley:

If they provide an API, then don’t try to crawl the page without it, use the API

1 Like

You’re all going to make me seem like some unethical and hideous villain trying to frustrate some decent developer somewhere, by illegally getting hold of his hard earned data. I don’t understand but there are widely used services like Goutte and Apifier that no one condemns or calls for its sanctioning by the internet regulating body or its operation blocked in certain countries. This is the first time I’m hearing other developers chide and scold another for attempting to access data in the form of markup. Or is it because I’m building my crawler from scratch instead of using those tools? If the site webmaster or administrator couldn’t take out the time to prepare his API for his colleagues, or for one reason or the other, doesn’t have the data in a portable format, that doesn’t mean another developer is not going to build what he/she/the society is in urgent need of. I asked a question earlier that no one answered:

If your data is so precious to you as a webmaster, you overlook the fact that it’s best you keep and use the data on your machine, and wouldn’t bother about anybody else exporting it using anything else besides a pen and paper. Just to clarify, I’m not hacking his site. I have done no wrong.

Iframes (ifr) can’t have document/window attributes. It would be an “access denied” type error.
To remove the iframe, place in a div, and use the same method:
Hello friend,
This would be the solution to your query

<script type="text/javascript">
onload=function() {
document.getElementById("el").removeChild(document.getElementById("el").firstChild);
}
</script>
<div id="el">
<iframe ...>
</div>

OK. Good to know. Unfortunately I’m unable to wrap the frames in those divs. This regex does not assist me

$r = preg_replace_callback('/(<iframe|<\/iframe>)/im', function ($matches) {
	if (strpos($matches[0], '/')) return $matches[1] . '</div>';

	else return '<div class=remove-frame>' . $matches[1];
}, $r);

I also tried wrapping it straight with the DOMDocument before converting it to a string but none worked for me. Sad

There might be a small error you are not aware off, do one thing sit down with cool mind and do a proof reading of coding, you might be missing small thing into it ! just do it once, else I’ll help you !!

My new post refuses to upload. It keeps saying body too similar to your recent post. This is annoying. @stwebonlinebranding I posted it here instead https://pastebin.com/pFEsNF95

In addition, the issue is exacerbated by the fact that that help me array is multidimensional. So I can’t pass it as a query URL (using http_build_query) or anything other than a bulk json string

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.