thanks for the reply.
i know how to create threads an such. addmitedly i’ve never thought about them being stateless or not so that i don’t know how to do. An example or another link pointing to an example would be good.
What is that i can’t get is how do i
process some links from the set in seperate threads
while at the same time adding the new links to the same set from all the threads.
I first thought if i make the set static and sychronized then each thread can update it without messing up and that takes care of updating the set.
then there’s the issue of retrieving links and updating.
the way it currently works is
the start url is fetched and links retrieve from that page are added to the set. once added each link is processed in turn and the process of getting all links from each page is repeated.
here’s the code for the crawl method
private void doCrawl(String startUrl, int maxUrls, boolean limitHost,
String searchString) {
/*do all checks to see if its a page to index and perform crawl*/
// Add start URL to the to crawl list.
toCrawlList.add(startUrl);
/*
* Perform actual crawling by looping through the To Crawl list.
*/
while (crawling && toCrawlList.size() > 0) {
//this was my first idea for taking 10 links from the set to pass to a thread for processing
// LinkedHashSet<String> pass = new LinkedHashSet<String>();
// String pUrl;
//add ten urls to the list to be processed in one thread
// for(int j=0; j<10;j++){
// pUrl=toCrawlList.iterator().next();
// pass.add(pUrl);
// toCrawlList.remove(pUrl);
// }
// Get URL at bottom of the list.
url = toCrawlList.iterator().next();
// Remove URL from the To Crawl list.
toCrawlList.remove(url);
// Convert string url to URL object.
URL verifiedUrl = verifyUrl(url);
// Add page to the crawled list.
crawledList.add(url);
// Download the page at the given URL.
//populate pageContents var with html for that url
getPage(verifiedUrl);
/*
* If the page was downloaded successfully, retrieve all its links
* and then see if it contains the search string.
*/
if (pageContents != null && pageContents.length() > 0) {
// Retrieve list of valid links from page.
ArrayList<String> links = new ArrayList<String>();
links = retrieveLinks(verifiedUrl, pageContents,
crawledList, limit);
toCrawlList.addAll(links);
//check to make sure bad and good words criteria are met
//some pages on my site have user submitted content so i filter adult language etc
getCriteria();
//get the meta content for this page,title desc, keywords
getMeta();
//just print some info to console window
updateStats(url, crawledList.size(), toCrawlList.size(), title, "...", "...");
//finalisation handles all mysql stuff clean strings and making them ready to //be inserted into the db
doFinalisation(verifiedUrl);
} else {
System.out.println("Page null or not downloaded properly so skipped...");
}
}
}
Ok that’s the method i did some rough editing of comments and removed some unecessary comments that i added to myself but that should be everything.
Any pointers on how i go about turning this into what i described earlier?
The finalisation method runs the mysql stuff in anothe thread but there is no communication between threads after the object is created which i think is why i’m going blank on this because i’ve never actually done a program where info is passed to and from threads repeatedly.