SitePoint Sponsor

User Tag List

Results 1 to 5 of 5
  1. #1
    SitePoint Member
    Join Date
    Feb 2006
    Posts
    7
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    MatchData#captures error when dealing with nilclass

    Code:
    #use it like this:
    
    # ruby LogParser.rb logFile.log
    
    #!/usr/local/bin/ruby
    require 'date'
    
    class LogEntry
      attr_reader :host, :user, :auth, :date, :referrer, :ua, :rcode, :nbytes, :url
      @@epat = Regexp.new('^(\S+) (\S+) (\S+) \[(.+)\] "(.+)" (\d{3}) (\d+|-) "(.*?)" "(.*?)"$');
      @@rpat = Regexp.new('\A(\S+)\s+(\S+)\s+(\S+)\Z');
      def initialize(line)
        @host, @user, @auth, ds, request, code, bs, @referrer, @ua = @@epat.match(line).captures
        @date = DateTime.strptime(ds, "%d/%b/%Y:%H:%M:%S %z");
        @rcode = Integer(code)
        @nbytes = (bs == "-" ? 0 : Integer(bs))
      
        @method, @url, @proto = @@rpat.match(request).captures
      end
      def to_s()
        "LogEntry[host:" + host + ", date:" + date.to_s + ", referrer:" + referrer +
              ", url:" + url + ", ua:" + ua + ", user:" + user + ", auth:" + auth +
          ", rcode:" + rcode.to_s + ", nbytes:" + nbytes.to_s  + "]";
      end
    end
    
    puts "Usage:: [ruby] LogParser.rb <inpfile>" if ARGV.length < 1
    inpfile = File.open(ARGV[0])
    t1 = Time.now
    nlines = 0
    start_date = end_date = nil
    le = nil
    hosts = Hash.new(0)
    urls = Hash.new(0)
    referrers = Hash.new(0)
    uastrings = Hash.new(0)
    st = Time.now
    while line = inpfile.gets
      begin
        le = LogEntry.new(line)
        start_date = le.date if !start_date
        hosts[le.host] += 1;
        urls[le.url] += 1;
        referrers[le.referrer] += 1;
        uastrings[le.ua] += 1;
      rescue
        print "Log entry parse failed at line: ", (nlines + 1), ", error: ", $!, "\n"
        print "LINE: ", line, "\n"
      end
      nlines += 1
      if nlines % 4096 == 0
        et = Time.now
        puts "processed " + nlines.to_s + " lines ... (" + (et - st).to_s + " seconds)"
        st = et
      end
    end
    end_date = le.date
    t2 = Time.now
    
    printf("start_date:%s, end_date:%s\n", start_date.to_s, end_date.to_s);
    printf("lines:%d, hosts:%d, urls:%d, referrers:%d, uastrings:%d\n", 
      nlines, hosts.length, urls.length, referrers.length, uastrings.length);
    print "Processing time : ", (t2 - t1).to_s, " seconds\n"
    
    
    # Do the sorting and display of top 20
    def print_top20(label, h)
      arr = h.sort { |a,b| b[1] <=> a[1] }
      print "------------ " + label + " -------------\n"
      for i in 0...20
        printf("%2d. %s (%d)\n", i, arr[i][0], arr[i][1]) rescue nil
      end
      puts
    end
    
    t1 = Time.now
    print_top20("Top 20 Hosts", hosts)
    print_top20("Top 20 URLs", urls)
    print_top20("Top 20 Referrers", referrers)
    print_top20("Top 20 UA Strings", uastrings)
    t2 = Time.now
    print "Sort and Display time: ", (t2 - t1).to_s, " seconds\n"
    this code is used to parse the apache log file. When i apply it with small file, like thousands of log entries, it is working properly. However, when i attempt to parse a real file, for about 10M with tens of thousands of lines, it no long works, returning error message for each line, for example:
    .......
    Log entry parse failed at line: 907, error: undefined method `captures' for nil:NilClass
    LINE: 66.249.72.138 - - [01/Feb/2006:08:18:49 +0000] "GET /hardware HTTP/1.1" 301 328 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
    Log entry parse failed at line: 908, error: undefined method `captures' for nil:NilClass
    LINE: 66.249.72.138 - - [01/Feb/2006:08:18:49 +0000] "GET /https/words/ HTTP/1.1" 301 328 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
    .......

    this error message shows up for every line from the very beginning(i.e. line 1). But when I try to break up the log file into some smaller file, containing thousands of lines for each piece, it works again. I'm not sure why...

  2. #2
    SitePoint Evangelist
    Join Date
    Jun 2004
    Location
    California
    Posts
    440
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm not sure what the memory limitations of Ruby are or if that may be the problem... but if it works for the smaller files maybe you should tell Ruby to split the log file into files of x size each and then parse each of those. Then, at the end of the script, you can delete those smaller files.
    Happy switcher to OS X running on a MacBook Pro.

    Zend Certified Engineer

  3. #3
    SitePoint Member
    Join Date
    Feb 2006
    Posts
    7
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yeah, I'm thinking this method as well. However, i'm not sure how we may split this text-liked ".log" file in ruby... beg for more comments please.

  4. #4
    SitePoint Zealot
    Join Date
    Jul 2004
    Location
    Oklahoma
    Posts
    119
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Instead of using class vars (@@) to hold your regexps try using global constants. I'm not sure WHY this would fix it, but it definitely seems a bit odd that it dies where it does.

    [edit] Also, I'd break that into multiple lines, you might just be running into a case where a given line doesn't match your regex, thus returns nil, and causes it to crash. Make sure that the regex works properly on a subset of the larger file.

  5. #5
    SitePoint Member
    Join Date
    Feb 2006
    Posts
    7
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    well, i attempted to change @@ to be $ for global variables, however, it doesn't really work, the same reason as before.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •