SitePoint Sponsor

User Tag List

Results 1 to 3 of 3

Hybrid View

  1. #1
    SitePoint Member
    Join Date
    Feb 2006
    Posts
    7
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Apache LogFile parser, need help.

    I'm a new user of Ruby, or strictly saying, I am a new guy in programming. I am currently doing a project which requires to write a parser of the log file of Apache web server in Ruby, so as to produce some page-visiting statistics, eg. which is the most/top10 popular page(s) in a specific day/week.

    As a new user, again and again I read a few books and related docs but still have no idea how to write the parser class. I'm posting the problem here and expecting you experts can give me a hand.

    So, the daily log file contains entries, each of which is following a standard format as shown below. And the "-" hyphen indicates the info. is not available.

    clientIP identd userid time request statusCode objSize
    for example,
    124.61.45.136 - - [06/Oct/2005:17:03:08 +0100] "GET /interface/video-ipod.html HTTP/1.1" 200 5657
    128.61.45.136 - - [06/Oct/2005:17:03:10 +0100] "GET /php/adlog.htm" 200 43

    each piece of info. is separated by a space, each entry is a new line, the whole file consists of lines of entries in this format.

    To parse this file, I know some basic idea:

    1. to read through the *.log file, I use the code
    source = File.new("12-01-2005.log", "r")
    while (line = source.gets)

    2. for each line(or say, entry), some parsing expressions:
    for clientIP, eg. 124.61.45.136, can be expressed by /[0-9]+(.[0-9]+)?/
    for identd, it is always be hyphen "-", so can be expressed by /-/
    for userid, it is arbitary many chars, so, /[a-zA-Z0-9]+/
    for time, eg.[06/Oct/2005:17:03:08 +0100], as it is starting with "[" and
    end with "]", so can be expressed by /^[$]/
    for request piece, eg. "GET /interface/video-ipod.html HTTP/1.1", can be /^"$"/
    other two are simply just digits /[0-9]+/

    3. the result of the parser class could probably be an array for further uses. that is, we write each of the parsed entry into an array of object "entry".

    So, this is the first step I need to do, I learnt a little and these are what I designed. I think there should be something which are not correct, and somewhere that need to be improved. Also, as I have no experience in Ruby, I cannot construct all these and write these in a class. I am hereby hoping your experts could help me with the solution. Every little helps! Thanks very much!

  2. #2
    SitePoint Zealot
    Join Date
    Nov 2004
    Location
    Yakima WA.
    Posts
    100
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    How about something like this

    I have this log file parser laying around for experimentation. Its a combination of my code and a bit I lifted somewhere elase but i don't remember where. This should get you where you are going. It will print out the top 20 IP's, URL's, Referrers and UA strings.

    You use it like this:

    ./ruby_log_parser.rb access.log

    And heres the code:

    Code:
    #!/usr/local/bin/ruby
    require 'date'
    
    class LogEntry
      attr_reader :host, :user, :auth, :date, :referrer, :ua, :rcode, :nbytes, :url
      @@epat = Regexp.new('^(\S+) (\S+) (\S+) \[(.+)\] "(.+)" (\d{3}) (\d+|-) "(.*?)" "(.*?)"$');
      @@rpat = Regexp.new('\A(\S+)\s+(\S+)\s+(\S+)\Z');
      def initialize(line)
        @host, @user, @auth, ds, request, code, bs, @referrer, @ua = @@epat.match(line).captures
        @date = DateTime.strptime(ds, "%d/%b/%Y:%H:%M:%S %z");
        @rcode = Integer(code)
        @nbytes = (bs == "-" ? 0 : Integer(bs))
      
        @method, @url, @proto = @@rpat.match(request).captures
      end
      def to_s()
        "LogEntry[host:" + host + ", date:" + date.to_s + ", referrer:" + referrer +
              ", url:" + url + ", ua:" + ua + ", user:" + user + ", auth:" + auth +
          ", rcode:" + rcode.to_s + ", nbytes:" + nbytes.to_s  + "]";
      end
    end
    
    puts "Usage:: [ruby] webstat.rb <inpfile>" if ARGV.length < 1
    inpfile = File.open(ARGV[0])
    t1 = Time.now
    nlines = 0
    start_date = end_date = nil
    le = nil
    hosts = Hash.new(0)
    urls = Hash.new(0)
    referrers = Hash.new(0)
    uastrings = Hash.new(0)
    st = Time.now
    while line = inpfile.gets
      begin
        le = LogEntry.new(line)
        start_date = le.date if !start_date
        hosts[le.host] += 1;
        urls[le.url] += 1;
        referrers[le.referrer] += 1;
        uastrings[le.ua] += 1;
      rescue
        print "Log entry parse failed at line: ", (nlines + 1), ", error: ", $!, "\n"
        print "LINE: ", line, "\n"
      end
      nlines += 1
      if nlines % 4096 == 0
        et = Time.now
        puts "processed " + nlines.to_s + " lines ... (" + (et - st).to_s + " seconds)"
        st = et
      end
    end
    end_date = le.date
    t2 = Time.now
    
    printf("start_date:%s, end_date:%s\n", start_date.to_s, end_date.to_s);
    printf("lines:%d, hosts:%d, urls:%d, referrers:%d, uastrings:%d\n", 
      nlines, hosts.length, urls.length, referrers.length, uastrings.length);
    print "Processing time : ", (t2 - t1).to_s, " seconds\n"
    
    
    # Do the sorting and display of top 20
    def print_top20(label, h)
      arr = h.sort { |a,b| b[1] <=> a[1] }
      print "------------ " + label + " -------------\n"
      for i in 0...20
        printf("%2d. %s (%d)\n", i, arr[i][0], arr[i][1]) rescue nil
      end
      puts
    end
    
    t1 = Time.now
    print_top20("Top 20 Hosts", hosts)
    print_top20("Top 20 URLs", urls)
    print_top20("Top 20 Referrers", referrers)
    print_top20("Top 20 UA Strings", uastrings)
    t2 = Time.now
    print "Sort and Display time: ", (t2 - t1).to_s, " seconds\n"

  3. #3
    SitePoint Member
    Join Date
    Feb 2006
    Posts
    7
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you very much for your code. It really helps me a lot! This program works just perfectly.

    One step over, I'm wondering if anyone could tell me how I can program to record the sorted result(not only Top 20, but for all of them) into a database(by mySQL) firstly, shortlist requests with only URLs and visiting times being stored in the database, and generate such Top 20s in from database.

    Looking forward to anyone's reply!


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •