SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Loading large CSV Files - Good Design Principles ?

    Hello,


    I need to write a function that processes a large csv file (1,000,000 lines +) and I was wondering if there are any good design principles (and hopefully examples) of loading large CSV files without consuming vast amounts of CPU and memory.
    The design principles can of course be above the implementation language. I just posted here because I am a php developer.

    Cheers

    Marc

  2. #2
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,578
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Just read a few lines at a time. You can sleep() if you need to add some artificial delay between processing some number of lines.

  3. #3
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dan Grossman View Post
    Just read a few lines at a time. You can sleep() if you need to add some artificial delay between processing some number of lines.
    Thank you Dan,

    I was also think about having some calculation where the code gets the number of lines (does as count) and then by some sort of math processes the contents in chunks and sleeps amybe a few seconds in between the chunks to allow the server to process other requests since the server which this code is running on will also process other HTTP requests.

    I am not good at math so if anyone knows the best way to create the amount of chunks (number of records to be processed between sleeps) based on the total rows of a csv file - that would be great!

    Cheers

    marc

  4. #4
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,578
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Why would the size of the file determine the 'chunk' size? Wouldn't it be determined through some testing by how many lines you can process without load climbing unacceptably high?

    Some division of the file size would be rather arbitrary -- way too much time spent sleeping on small files, not enough on large.

    That assumes there's any reason to break this up at all. Do you have reason to believe you can't just process the file as fast as your script can handle it? A million isn't all that large.. if you're just doing basic text manipulation, or inserting the data straight into a database, it'd probably take a few minutes at most.

  5. #5
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hello Dan,

    My thought is that if I build some robust code now I won't have to worry about the future when there could be 100's of csv files to process with millions of records and that the code will be executing on the same server that also processed other http requests.

    The code will monitor a directory for files and if located will process each file in sequence. I am not sure but if the php code cannot monitor the directory all the time in will need to sleep maybe a few seconds wake up and check if any csv files are within the directoy - if not then sleep etc.. Maybe there is a way in php to register with linux some handler that if a file is loaded into the directory then process so it does not have to sleep/wake up/sleep/wake up etc...

    So my logic is that when the code wakes up again there could be 20 csv files that it has to processes thus if there is no control over the processing in theory the script could run the linux box at 100% until all files are completed thus any new http requests coming in would be slowed in response time.

    Cheers

    Marc

  6. #6
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,578
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by marc_online View Post
    So my logic is that when the code wakes up again there could be 20 csv files that it has to processes thus if there is no control over the processing in theory the script could run the linux box at 100% until all files are completed thus any new http requests coming in would be slowed in response time.
    I still think it doesn't make sense to arbitrarily choose when to pause based on dividing up the file size. Do a test first and find out how intensive this process is.

    As for the whole sleep/wake/check thing.. .just add a crontab to run the script every X minutes.

  7. #7
    SitePoint Enthusiast
    Join Date
    May 2006
    Posts
    48
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hello Dan,

    I will and will also report my findings back here. It would be great if you could register some sort of handler in Linux from php. So the handle only gets called if a file is added to the directory instead of runing a cron every xn secs..

    Cheers

    Marc

  8. #8
    Follow Me On Twitter: @djg gold trophysilver trophybronze trophy Dan Grossman's Avatar
    Join Date
    Aug 2000
    Location
    Philadephia, PA
    Posts
    20,578
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by marc_online View Post
    Hello Dan,

    I will and will also report my findings back here. It would be great if you could register some sort of handler in Linux from php. So the handle only gets called if a file is added to the directory instead of runing a cron every xn secs..

    Cheers

    Marc
    You can always write one, but all it'd be is a program that checks very frequently. Triggers are a figment of our programmers' imaginations -- computers run on clock ticks


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •