Ok guys go easy I’m new here. I’ve ran into a problem and need some help. I’m building a crawler for my websites Web Directory section. All crawlable url’s will be added by members who have websites. Building a crawler is nothing new to me. I’ve built a couple of basic ones before. However, this time I’m really gonna stress the importance of security since it will be utilized by members and not just me…but I digress…
Here’s the scenario.
John Doe decides to add his url example.com to the directory for my crawler to crawl. But John Doe is trying to be bad and upload a malicious file. Now the root folder of example.com only contains one file and it is index.exe. Are you with me so far? If he only adds (example.com) to the form, how can I use php to get the complete path of the corresponding url? I’ve tried parse_url and pathinfo, but these fail when trying to analyze the actual absolute remote url. Any guidance would be greatly appreciated. Thanks for your time.
Why would your crawler care? It shouldnt execute anything, just crawl over the file, possibly detect it’s type (which would be the first red flag - content type is not text/html , or similar), retrieve the contents, and decide that it’s garbage characters.
good point. however haven’t certain virus files been self executable upon upload to servers or am I incorrect in thinking this. I’m just trying to take every necessary precaution. Plus, it would be nice to judge the type file or path being submitted prior to wasting precious bandwidth scanning the file.
If he adds example.com then you can only request like:
GET / HTTP/1.1
…
There is no way of determining what files are in the root directory of a web server unless it has the directory listing display on. You shouldn’t need to know what files are there as a crawler.
If he adds example.com/index.exe then you can refuse to request certain files based on the extension and content-type.
thanks for clearing that little bit of info up for me the182guy. I always check extension types just didn’t know if the scenario I posted was feasible. I feel secure now. Thanks!