How to build a bot that can login/submit forms/scrape information?

Hi all,

I’ve done countless Google searches on the topic, but I can’t find any information on building a bot that can submit forms and scrape information. What language(s) does this require (and which is easiest and quickest to learn)? Are there any good tutorials/other documentation that explain what I need to do to get started?

An example of what I’m needing is a bot that I could tell to login to my Google AdWords account every morning, pull information, and insert it into a MySQL DB.

Thanks for your help.

if you intend to use perl, maybe WWW::Mechanize will work. You can find it on CPAN.

Please don’t.

If it is your own site then you should be able to do whatever processing is required without requiring a script to login and fill out forms.

If it is someone else’s site then what you are trying to create is a spambot and should you actually succeed in creating one I hope you get caught and thrown into prison on a food free diet for a million years - which ought to be the minimum sentence for such troublemakers (which works out at about at $1 a minute when you look at the cost of the resources they have stolen)

Thanks, KevinR. I’ll check that out.

I’m not really sure what language to use, I was really trying to get ideas from this thread as to what’s best for my situation.

@ The New Guy. Why not? It’s of a lot of value to me to learn how to do this. No, I’m not spamming anyone, and no, I’m not doing anything illegal…just trying to automate a few tasks to save time. Did you even read my example?

@ felgall, instead of assuming that I’m creating a spambot and going off on a torrent, why don’t you give any valuable input that you may have? That’s all I was asking when I created this thread…

PS. For the record, I hate spam with a passion. Whenever I first learned PHP and started using it on my sites, I didn’t take the necessary precautions to stop spambots from abusing my mailserver. This happened twice and I eventually had to shut down both VPS’s and it was the biggest pain in the world. So, please refrain from making assumptions and making idiotic comments. If you don’t have anything nice to say (in this case, helpful), don’t say anything at all.

You have no idea what this guys intentions are. It could be he is going to create a spam bot, but it could as well be he is not. Maybe he just wants to automate a process. Your comments are totally without merit and should be removed. I am going to report your post as I believe it is totally out of line.

Perl is really the only scripting langauage I am very familiar with so thats why I recommend you look into WWW::Mechanize. I have no idea if it will work for your particular situation but it’s where I would start if I was going to do something like this.

As I said in my earlier post - if it is for automating something on his own site then it can be done far simpler without any need to fill out the form. All that is needed is to code the script so that it can perform the same processing as would be obtained when the form is submitted without a need for the form ti be filled out. Automating processes on your own site can always be done without needing the script to fill out a form much more easily than writing it to fill out the form first. Scripting to fill out a form is only necessary when you don’t have access to make a minor modification to the code that is processing the form so as to allow it to be accessed by your script without the form. The only time that you don’t have the access to simplify what you are doing by making that simple modification is if the script you are calling is on someone else’s site. If they mean for you to be able to automate processes calling their script then they will provide the necessary code to be able to hook into it without a form. The only time a form needs to be filled out by a script is when you are trying to break into someone else’s script where they do not want you to.

The simplest fix is to just set the script processing the form up to handle either posted fields or session fields (you may want an extra session field with a specific hard to guess value to make it more secure). The script can then run either with values posted from the form or passed in a session from your automated script (which therefore doesn’t need to touch the form). The other alternative is to pass the fields in a cookie or in the query string.

Thats what you should have said in your first post.

If it were for use on his own site the discussion would be moot. But in reality, the scenario is the same for a remote site. A script can send and recieve data (even post data) without ever touching the remote sites forms. All it needs to know is the forms structure and where to send the data.

I end my participation in this thread.

OK, that statement nearly made me fall out of my chair.

How can you be so nieve to think that every site owner in the world has the resources to set up some sort of mechanism to receive information like this? I literally talked to 8 different owners of sites (where we’re currently advertising) last week and none of them did…most of them didn’t even know what PHP is (my original suggestion was that our IT department put files on their FTP and they create a script that would parse those files). However, they all thought it was a GREAT idea for us to automate the process on our end. I guess that means I’m doing something I shouldn’t. Shame on me.

Sorry, but that was just an ignorant statement.

It will easier of the site stays the same and you can code up something only for a specific site, such as adsense. As to the language, use whatever your comfortable with. Though, low levels languages might be overkill.

Sellenium is a great debugging and automation tool that uses a proxy to issue form posts…

http://www.openqa.org/selenium-rc/perl/WWW-Selenium.html

I don’t see the big deal in asking to do this?

Sites such as Last.fm and Facebook (I believe) have scripts to automatically login to other accounts.

I’ve made something similar using Python with it’s cookie and url handling libraries. You could then call this from a cron job.

If you’re confident with PHP, you should use that. The Simpletest unit-testing framework has a sub-component (SimpleBrowser), which is used for writing integration tests. It can also be used standalone, and is excellent for this kind of things.
http://www.lastcraft.com/browser_documentation.php