SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Member
    Join Date
    Feb 2006
    0 Post(s)
    0 Thread(s)

    Apache LogFile parser, need help.

    I'm a new user of Ruby, or strictly saying, I am a new guy in programming. I am currently doing a project which requires to write a parser of the log file of Apache web server in Ruby, so as to produce some page-visiting statistics, eg. which is the most/top10 popular page(s) in a specific day/week.

    As a new user, again and again I read a few books and related docs but still have no idea how to write the parser class. I'm posting the problem here and expecting you experts can give me a hand.

    So, the daily log file contains entries, each of which is following a standard format as shown below. And the "-" hyphen indicates the info. is not available.

    clientIP identd userid time request statusCode objSize
    for example, - - [06/Oct/2005:17:03:08 +0100] "GET /interface/video-ipod.html HTTP/1.1" 200 5657 - - [06/Oct/2005:17:03:10 +0100] "GET /php/adlog.htm" 200 43

    each piece of info. is separated by a space, each entry is a new line, the whole file consists of lines of entries in this format.

    To parse this file, I know some basic idea:

    1. to read through the *.log file, I use the code
    source ="12-01-2005.log", "r")
    while (line = source.gets)

    2. for each line(or say, entry), some parsing expressions:
    for clientIP, eg., can be expressed by /[0-9]+(.[0-9]+)?/
    for identd, it is always be hyphen "-", so can be expressed by /-/
    for userid, it is arbitary many chars, so, /[a-zA-Z0-9]+/
    for time, eg.[06/Oct/2005:17:03:08 +0100], as it is starting with "[" and
    end with "]", so can be expressed by /^[$]/
    for request piece, eg. "GET /interface/video-ipod.html HTTP/1.1", can be /^"$"/
    other two are simply just digits /[0-9]+/

    3. the result of the parser class could probably be an array for further uses. that is, we write each of the parsed entry into an array of object "entry".

    So, this is the first step I need to do, I learnt a little and these are what I designed. I think there should be something which are not correct, and somewhere that need to be improved. Also, as I have no experience in Ruby, I cannot construct all these and write these in a class. I am hereby hoping your experts could help me with the solution. Every little helps! Thanks very much!

  2. #2
    SitePoint Zealot
    Join Date
    Nov 2004
    Yakima WA.
    0 Post(s)
    0 Thread(s)

    How about something like this

    I have this log file parser laying around for experimentation. Its a combination of my code and a bit I lifted somewhere elase but i don't remember where. This should get you where you are going. It will print out the top 20 IP's, URL's, Referrers and UA strings.

    You use it like this:

    ./ruby_log_parser.rb access.log

    And heres the code:

    require 'date'
    class LogEntry
      attr_reader :host, :user, :auth, :date, :referrer, :ua, :rcode, :nbytes, :url
      @@epat ='^(\S+) (\S+) (\S+) \[(.+)\] "(.+)" (\d{3}) (\d+|-) "(.*?)" "(.*?)"$');
      @@rpat ='\A(\S+)\s+(\S+)\s+(\S+)\Z');
      def initialize(line)
        @host, @user, @auth, ds, request, code, bs, @referrer, @ua = @@epat.match(line).captures
        @date = DateTime.strptime(ds, "%d/%b/%Y:%H:%M:%S %z");
        @rcode = Integer(code)
        @nbytes = (bs == "-" ? 0 : Integer(bs))
        @method, @url, @proto = @@rpat.match(request).captures
      def to_s()
        "LogEntry[host:" + host + ", date:" + date.to_s + ", referrer:" + referrer +
              ", url:" + url + ", ua:" + ua + ", user:" + user + ", auth:" + auth +
          ", rcode:" + rcode.to_s + ", nbytes:" + nbytes.to_s  + "]";
    puts "Usage:: [ruby] webstat.rb <inpfile>" if ARGV.length < 1
    inpfile =[0])
    t1 =
    nlines = 0
    start_date = end_date = nil
    le = nil
    hosts =
    urls =
    referrers =
    uastrings =
    st =
    while line = inpfile.gets
        le =
        start_date = if !start_date
        hosts[] += 1;
        urls[le.url] += 1;
        referrers[le.referrer] += 1;
        uastrings[] += 1;
        print "Log entry parse failed at line: ", (nlines + 1), ", error: ", $!, "\n"
        print "LINE: ", line, "\n"
      nlines += 1
      if nlines % 4096 == 0
        et =
        puts "processed " + nlines.to_s + " lines ... (" + (et - st).to_s + " seconds)"
        st = et
    end_date =
    t2 =
    printf("start_date:%s, end_date:%s\n", start_date.to_s, end_date.to_s);
    printf("lines:%d, hosts:%d, urls:%d, referrers:%d, uastrings:%d\n", 
      nlines, hosts.length, urls.length, referrers.length, uastrings.length);
    print "Processing time : ", (t2 - t1).to_s, " seconds\n"
    # Do the sorting and display of top 20
    def print_top20(label, h)
      arr = h.sort { |a,b| b[1] <=> a[1] }
      print "------------ " + label + " -------------\n"
      for i in 0...20
        printf("%2d. %s (%d)\n", i, arr[i][0], arr[i][1]) rescue nil
    t1 =
    print_top20("Top 20 Hosts", hosts)
    print_top20("Top 20 URLs", urls)
    print_top20("Top 20 Referrers", referrers)
    print_top20("Top 20 UA Strings", uastrings)
    t2 =
    print "Sort and Display time: ", (t2 - t1).to_s, " seconds\n"

  3. #3
    SitePoint Member
    Join Date
    Feb 2006
    0 Post(s)
    0 Thread(s)
    Thank you very much for your code. It really helps me a lot! This program works just perfectly.

    One step over, I'm wondering if anyone could tell me how I can program to record the sorted result(not only Top 20, but for all of them) into a database(by mySQL) firstly, shortlist requests with only URLs and visiting times being stored in the database, and generate such Top 20s in from database.

    Looking forward to anyone's reply!


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts