Process items from a set in multiple threads

shotaz · June 1, 2010, 2:14am

Hi everyone,

I’m working on a spider (and i know there are several out there)
at present it works great and i’m happy with its performance.
But i’d like to go even further and make the program multi threaded.

I’m storing a list of urls in a Linked hashset and a list of urls already crawled in another.

What i’d like is to get about 10 urls from the list to be crawled then process those in a thread by itself, at the same time i’d like to get another 10 and do the same and have about 3 threads running at a time that are processing urls from the same list.

once each url is processed all links are retrieved from that url and added to the list to be crawled.

this might be simple to implement but for some reason i just can’t get my head around how to go about it.

I can post sample code if needed but you’ll have to tell me what you need to see because its a massive program almost at 1.5MB when the jar is built… I would use other already made spiders but this one has been built from ground to match my particular needs because i’m not interested in creating a search index or anything of that nature and it is built specifically to work with my websites.

thanks in advance

sg707 · June 4, 2010, 3:13pm

If you truly want to go for the BEST solution. Then, I advise using JMS. Basically, this is exactly you are reinventing. You drop a message into a queue or aka list in your case, a thread listener picks up the message and execute. If your thread wants to add more URL then simply add to the queue. This would mean you need to run a JMS. I hear http://www.rabbitmq.com/ is a good candidate. If you haven’t tried JMS then this would be a good opportunity. Also, by using JMS you can limit the “Thread Workers”. Let say, you need to execute 1000000000 ULR’s. Would you create 10000000000 thread all at the same time? You can limit the JMS listener app in terms how many URL to process at one time.

shotaz · June 4, 2010, 8:47pm

thank you very much. JMS sounds like exactly what i’m doing and trust me if i knew it existed i wouldn’t be reinventing, most of my program are reusable parts and algorithms i got from different sources. The only reason i recreated a spider is because none of the free ones i found were flexible enough to be tailered to what i’m doing with my site.

There is a single mysql object because of the way i’ve configured my mysql server all transfers are handled by this single object which ensures that there aren’t too many connections from my ip otherwise connections start getting refused. that way load on my server doesn’t spike up

Again thanks i’m about to look at the JMS thing now…

sg707 · June 4, 2010, 3:18pm

Also, why would you run mysql in another thread?

sg707 · June 1, 2010, 5:47pm

You can check this

If I were you, I make the threads stateless. Meaning, threads does not depend on each other or it could really get harry.

shotaz · June 4, 2010, 10:38am

thanks for the reply.
i know how to create threads an such. addmitedly i’ve never thought about them being stateless or not so that i don’t know how to do. An example or another link pointing to an example would be good.

What is that i can’t get is how do i
process some links from the set in seperate threads
while at the same time adding the new links to the same set from all the threads.

I first thought if i make the set static and sychronized then each thread can update it without messing up and that takes care of updating the set.

then there’s the issue of retrieving links and updating.

the way it currently works is

 the start url is fetched and links retrieve from that page are added to the set. once added each link is processed in turn and the process of getting all links from each page is repeated.

here’s the code for the crawl method


private void doCrawl(String startUrl, int maxUrls, boolean limitHost,
            String searchString) {
        /*do all checks to see if its a page to index and perform crawl*/


        // Add start URL to the to crawl list.
        toCrawlList.add(startUrl);
        /*
         * Perform actual crawling by looping through the To Crawl list.
         */
        while (crawling && toCrawlList.size() > 0) {
//this was my first idea for taking 10 links from the set to pass to a thread for processing
//        LinkedHashSet<String> pass = new LinkedHashSet<String>();
//        String pUrl;
        //add ten urls to the list to be processed in one thread
//            for(int j=0; j<10;j++){
//                pUrl=toCrawlList.iterator().next();
//                pass.add(pUrl);
//                toCrawlList.remove(pUrl);
//            }
            // Get URL at bottom of the list.
            url = toCrawlList.iterator().next();

            // Remove URL from the To Crawl list.
            toCrawlList.remove(url);

            // Convert string url to URL object.
            URL verifiedUrl = verifyUrl(url);
         
            // Add page to the crawled list.
            crawledList.add(url);

            // Download the page at the given URL.


//populate pageContents var with html for that url
            getPage(verifiedUrl);



            /*
             * If the page was downloaded successfully, retrieve all its links
             * and then see if it contains the search string.
             */

            if (pageContents != null && pageContents.length() > 0) {
               // Retrieve list of valid links from page.
                ArrayList<String> links = new ArrayList<String>();
                links = retrieveLinks(verifiedUrl, pageContents,
                        crawledList, limit);

                toCrawlList.addAll(links);

                //check to make sure bad and good words criteria are met
      //some pages on my site have user submitted content so i filter adult language etc
                getCriteria();
                //get the meta content for this page,title desc, keywords
                getMeta();

//just print some info to console window
                updateStats(url, crawledList.size(), toCrawlList.size(), title, "...", "...");

               
//finalisation handles all mysql stuff clean strings and making them ready to //be inserted into the db
              doFinalisation(verifiedUrl);

            } else {
                System.out.println("Page null or not downloaded properly so skipped...");
            }
        }

    }

Ok that’s the method i did some rough editing of comments and removed some unecessary comments that i added to myself but that should be everything.

Any pointers on how i go about turning this into what i described earlier?
The finalisation method runs the mysql stuff in anothe thread but there is no communication between threads after the object is created which i think is why i’m going blank on this because i’ve never actually done a program where info is passed to and from threads repeatedly.

Topic		Replies	Views
Multithread calls Get Started	36	1460	June 9, 2011
Anyone using message queues? PHP	8	842	May 10, 2010
Shared connection or individual connection? .NET	17	1340	September 23, 2014
Ruby threads Ruby	2	1414	April 21, 2015
How do you sequence tasks in PHP? PHP	3	453	October 8, 2014

Process items from a set in multiple threads

Related topics