SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    SitePoint Addict Skookum's Avatar
    Join Date
    Sep 2006
    Location
    Idaho
    Posts
    375
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Python 'rb'.read() Memory Error

    I am currently playing with a hashing program to go through our file server and dump all the file hashes into a database so that I can compare them later on and remove duplicates.

    But I have found a handful of files that I am unable to pull the hash from, or rather to do an rb.read().

    Here is my code
    Code:
    import os, hashlib, time, pyodbc, gc
    
    gc.enable()
    
    StartTime = time.localtime()
    
    cnxn = pyodbc.connect("DSN=Work2", autocommit=True)
    cursor = cnxn.cursor()
    
    for root, dirs, files in os.walk('K:\\'):
    	for name in files:
    		try:
    			FileInfo = os.stat(os.path.join(root, name))
    			FilePath = open(os.path.join(root, name), 'r')
    			FileHash = open(os.path.join(root, name), 'rb').read()
    			FileCorrectPath = os.path.join(root, name).replace('"', '`').replace("'", "`")
    			cursor.execute("INSERT INTO NetworkFileInfo (FileName, Hash, CreationDate, ModifiedDate, DateStamp) VALUES('" + FileCorrectPath + "', '" + hashlib.md5(FileHash).hexdigest() + "', '" + time.strftime('%Y-%m-%d', time.gmtime(FileInfo.st_ctime)) + "', '" + time.strftime('%Y-%m-%d', time.gmtime(FileInfo.st_mtime)) + "', '" + time.strftime('%Y-%m-%d') + "')")
    		except:
    			EndTime = time.localtime()
    			print time.mktime(EndTime) - time.mktime(StartTime)
    			cnxn.close()
    			print os.path.join(root, name)
    			raise
    		try:
    			del(FileInfo)
    			del(FilePath)
    			del(FileHash)
    			del(FileCorrectPath)
    			gc.collect()
    		except:
    			raise
    I am just starting again at Python, so be gentle.

    My problem comes from the
    FileHash = open(os.path.join(root, name), 'rb').read()
    On certain files when I run that I receive an error
    K:\Altium Support Files\Altium Designer 6 Updates\Build 6.8.1.11735\AltiumDesigner6Update(9346to11735).exe
    Traceback (most recent call last):
    File "C:\Python\Hash.py", line 16, in <module>
    FileHash = open(os.path.join(root, name), 'rb').read()
    MemoryError
    This particular file is about 500 MB in size, so I can see why it is dieing, but so far I have not learned a way to gracefully handle these errors and keep on processing, nor have I learned how to fix these errors.

    I must admit, I don't really understand garbage collection or how to properly clear out all of my variables in Python, so that might be part of the problem.

    Is there any other way to pull file hashes without reading the whole file into memory? Or is there a better way to handle exceptions?
    Links, books, critisisms, and help are all welcome.
    Paranoia is no longer a mental illness it is a way of life - Me

  2. #2
    SitePoint Evangelist
    Join Date
    Aug 2007
    Posts
    566
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm not a pro in any case neither, but something seems strange in your code:
    Code:
    FilePath = open(os.path.join(root, name), 'r')
    FileHash = open(os.path.join(root, name), 'rb').read()
    Here, you open 2 times the same file, on time on read mode, and the second tim ein read-write. That seems a bit strange to me. Open your file once, and use it 2 times.

    If you want to get the file complete path, use
    Code:
    os.path.abspath(os.path.join(root, name))
    Second, why don't you use the hashlib.* functions ?
    Code:
    import hashlib, sys
    f=open('c:\ggfil370.dat')
    hash=hashlib.sha224(f.read()).hexdigest()
    print>>sys.stdout, hash
    >>'047c3ecbf458bae728ef6c4314fe2bd6c3846224f9ad150ed71c9a08'
    Took less than 2 seconds to get the hash on an 490Mo file on my "poor" p4 3Ghz
    But I cannot explain you why your script failed, I'm not used enough to python to that, I just started using it 3 month ago, and did not looked in depth at the GC.

  3. #3
    SitePoint Addict Skookum's Avatar
    Join Date
    Sep 2006
    Location
    Idaho
    Posts
    375
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for the reply.

    Sorry I copied and pasted the code before I changed a couple of things. I have already removed the FilePath open and am just using the FileHash open.
    I can't really remember what I was using the FilePath info for now though.

    After reading some on abspath, I don't really understand how it differs from the os.path.join(root, name) that I am currently using. I mean I could set it to a variable and remove the os.path.join(root, name) which would be cleaner code, but I haven't gone through to clean out my code that much yet.

    I just did a quick test with your hash code, and it appears to have worked.
    I will play with it a little more, and post back my findings.

    Thanks for the help!
    Paranoia is no longer a mental illness it is a way of life - Me

  4. #4
    SitePoint Addict Skookum's Avatar
    Join Date
    Sep 2006
    Location
    Idaho
    Posts
    375
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Okay I have changed the script around a little more to handle other errors that I get.
    For example I actually found a file that had a negative date on it so I had to use abs for the date values.

    At some point I will learn threading so that it can process quicker.

    But here is my code so far:
    Code:
    import os, hashlib, time, pyodbc, gc
    
    gc.enable()
    
    StartTime = time.localtime()
    
    cnxn = pyodbc.connect("DSN=Work2", autocommit=True)
    cursor = cnxn.cursor()
    
    for root, dirs, files in os.walk('K:\\'):
    	for name in files:
    		try:
    			FileCorrectPath = os.path.join(root, name).replace('"', '`').replace("'", "`")
    			if len(FileCorrectPath) < 255:
    				FileInfo = os.stat(os.path.join(root, name))
    				FileHashOpen = open(os.path.join(root, name))
    				FileHash = hashlib.md5(FileHashOpen.read()).hexdigest()
    				cursor.execute("INSERT INTO NetworkFileInfo (FileName, Hash, CreationDate, ModifiedDate, DateStamp) VALUES('" + FileCorrectPath + "', '" + FileHash + "', '" + time.strftime('%Y-%m-%d', time.gmtime(abs(FileInfo.st_ctime))) + "', '" + time.strftime('%Y-%m-%d', time.gmtime(abs(FileInfo.st_mtime))) + "', '" + time.strftime('%Y-%m-%d') + "')")
    				FileHashOpen.close()
    			else:
    				FileInfo = "None"
    				FileHash = "None"
    		except:
    			EndTime = time.localtime()
    			print time.mktime(EndTime) - time.mktime(StartTime)
    			cnxn.close()
    			print os.path.join(root, name)
    			FileHashOpen.close()
    			raise
    		try:
    			del(FileInfo)
    			del(FileHash)
    			del(FileCorrectPath)
    			gc.collect()
    		except:
    			raise
    But I did run into the same memory problem again, but this time it was on a 1.3 GB TIFF file.
    Is there any way to break this up to process it and pull the hash?
    Paranoia is no longer a mental illness it is a way of life - Me

  5. #5
    SitePoint Evangelist
    Join Date
    Aug 2007
    Posts
    566
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Try maybe to move your del() into the main loop.
    Putting them there will force to unallocate the objects just after their use, not when all the processing is done.
    Maybe this will help
    Code:
    StartTime = time.localtime()
    
    cnxn = pyodbc.connect("DSN=Work2", autocommit=True)
    cursor = cnxn.cursor()
    
    for root, dirs, files in os.walk('K:\\'):
    	for name in files:
    		try:
    			FileCorrectPath = os.path.join(root, name).replace('"', '`').replace("'", "`")
    			if len(FileCorrectPath) < 255:
    				FileInfo = os.stat(os.path.join(root, name))
    				FileHashOpen = open(os.path.join(root, name))
    				FileHash = hashlib.md5(FileHashOpen.read()).hexdigest()
    				cursor.execute("INSERT INTO NetworkFileInfo (FileName, Hash, CreationDate, ModifiedDate, DateStamp) VALUES('" + FileCorrectPath + "', '" + FileHash + "', '" + time.strftime('&#37;Y-%m-%d', time.gmtime(abs(FileInfo.st_ctime))) + "', '" + time.strftime('%Y-%m-%d', time.gmtime(abs(FileInfo.st_mtime))) + "', '" + time.strftime('%Y-%m-%d') + "')")
    				FileHashOpen.close()
          
                                   del(FileInfo)
                                   del(FileHash)
                                   gc.collect()
          
    			else:
    				FileInfo = "None"
    				FileHash = "None"
    		except:
    			EndTime = time.localtime()
    			print time.mktime(EndTime) - time.mktime(StartTime)
    			cnxn.close()
    			print os.path.join(root, name)
    			FileHashOpen.close()
    			raise

  6. #6
    SitePoint Addict Skookum's Avatar
    Join Date
    Sep 2006
    Location
    Idaho
    Posts
    375
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Nope, still the same problem. And so far I haven't found any other work arounds. But I am still searching.
    Paranoia is no longer a mental illness it is a way of life - Me


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •