Java...java... help with JAVA please

Haven’t been here for a couple of mths so just hoping I remember how to post code.

Been on a big adventure trying to discover what it is I really want to do. For those that don’t know I started out on a Python adventure a few months ago but couldn’t find anything to do with it that really held my interest.

I somehow ended up playing with phone apps and discovered that Java was running on the back-end. My learning stopped when I got to the testing stage as I don’t own a smart phone (a sign of my age). The good thing to come out of this is that I have decided to learn Java. I have done several tutorials but things aren’t going as I had hoped. I am sick of writing “hello world” or “hello Bob” in php, Python, and now Java.

My goal here is to go back to my old way of learning which is by taking other scripts apart and building something that will work as I want it. As I go along I’m hoping that I will start to build a better picture in my head. I have done enough php to understand ‘for loops’,‘arrays’, etc.

Thats the end to my intro…

I’m working on a scraper script. Its more advanced than I am, but I like the script.

import java.io.IOException;
import java.util.LinkedList;
import java.util.List;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class SpiderLegs {
	
    // We'll use a fake USER_AGENT so the web server thinks the robot is a normal web browser.
    private static final String USER_AGENT =
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
    private List<String> links = new LinkedList<String>();
    private Document htmlDocument;


    /**
     * This performs all the work. It makes an HTTP request, checks the response, and then gathers
     * up all the links on the page. Perform a searchForWord after the successful crawl
     * 
     * @param url
     *            - The URL to visit
     * @return whether or not the crawl was successful
     */
    public boolean crawl(String url)
    {
        try
        {
            Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
            Document htmlDocument = connection.get();
            this.htmlDocument = htmlDocument;
            if(connection.response().statusCode() == 200) // 200 is the HTTP OK status code
                                                          // indicating that everything is great.
            {
                System.out.println("\n**Visiting** Received web page at " + url);
            }
            if(!connection.response().contentType().contains("text/html"))
            {
                System.out.println("**Failure** Retrieved something other than HTML");
                return false;
            }
            Elements linksOnPage = htmlDocument.select("a[href]");
            System.out.println("Found (" + linksOnPage.size() + ") links");
            for(Element link : linksOnPage)
            {
                this.links.add(link.absUrl("href"));
            }
            return true;
        }
        catch(IOException ioe)
        {
            // We were not successful in our HTTP request
            return false;
        }
    }


    /**
     * Performs a search on the body of on the HTML document that is retrieved. This method should
     * only be called after a successful crawl.
     * 
     * @param searchWord
     *            - The word or string to look for
     * @return whether or not the word was found
     */
    public boolean searchForWord(String searchWord)
    {
        // Defensive coding. This method should only be used after a successful crawl.
        if(this.htmlDocument == null)
        {
            System.out.println("ERROR! Call crawl() before performing analysis on the document");
            return false;
        }
        System.out.println("Searching for the word " + searchWord + "...");
        String bodyText = this.htmlDocument.body().text();
        return bodyText.toLowerCase().contains(searchWord.toLowerCase());
    }


    public List<String> getLinks()
    {
        return this.links;
    }

}

The first part of the script is ok. It just counts the links on the page.
The second Method (think I have that correct) starting at:

 public boolean searchForWord(String searchWord)
```
 
I dont really want
What I do want to do is search for a price/prices. I'm not sure if I need another method using:

Elements price = document.select(".zsg-photo-card-price:contains($)"); //Get price

or if I can somehow combine this with the first. Or can I put both Elements into variables or something.

Heres the other two scripts to make up the package:

public class Spider {
private static final int MAX_PAGES_TO_SEARCH = 20;
private Set pagesVisited = new HashSet();
private List pagesToVisit = new LinkedList();

  /**
   * Our main launching point for the Spider's functionality. Internally it creates spider legs
   * that make an HTTP request and parse the response (the web page).
   * 
   * @param url
   *            - The starting point of the spider
   * @param searchWord
   *            - The word or string that you are searching for
   */
  public void search(String url, String searchWord)
  {
      while(this.pagesVisited.size() < MAX_PAGES_TO_SEARCH)
      {
          String currentUrl;
          SpiderLegs leg = new SpiderLegs();
          if(this.pagesToVisit.isEmpty())
          {
              currentUrl = url;
              this.pagesVisited.add(url);
          }
          else
          {
              currentUrl = this.nextUrl();
          }
          leg.crawl(currentUrl); // Lots of stuff happening here. Look at the crawl method in
                                 // SpiderLeg
          boolean success = leg.searchForWord(searchWord);
          if(success)
          {
              System.out.println(String.format("**Success** Word %s found at %s", searchWord, currentUrl));
              break;
          }
          this.pagesToVisit.addAll(leg.getLinks());
      }
      System.out.println("\n**Done** Visited " + this.pagesVisited.size() + " web page(s)");
  }


  /**
   * Returns the next URL to visit (in the order that they were found). We also do a check to make
   * sure this method doesn't return a URL that has already been visited.
   * 
   * @return
   */
  private String nextUrl()
  {
      String nextUrl;
      do
      {
          nextUrl = this.pagesToVisit.remove(0);
      } while(this.pagesVisited.contains(nextUrl));
      this.pagesVisited.add(nextUrl);
      return nextUrl;
  }

}


public class SpiderTest {

/**
 * This is our test. It creates a spider (which creates spider legs) and crawls the web.
 * 
 * @param args
 *            - not used
 */
public static void main(String[] args)
{
    Spider spider = new Spider();
    spider.search("http://www.zillow.com/denver-co/", "four bedroom");
   
}

}

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.