I am currently playing with a hashing program to go through our file server and dump all the file hashes into a database so that I can compare them later on and remove duplicates.
But I have found a handful of files that I am unable to pull the hash from, or rather to do an rb.read().
Here is my code
I am just starting again at Python, so be gentle.Code:import os, hashlib, time, pyodbc, gc gc.enable() StartTime = time.localtime() cnxn = pyodbc.connect("DSN=Work2", autocommit=True) cursor = cnxn.cursor() for root, dirs, files in os.walk('K:\\'): for name in files: try: FileInfo = os.stat(os.path.join(root, name)) FilePath = open(os.path.join(root, name), 'r') FileHash = open(os.path.join(root, name), 'rb').read() FileCorrectPath = os.path.join(root, name).replace('"', '`').replace("'", "`") cursor.execute("INSERT INTO NetworkFileInfo (FileName, Hash, CreationDate, ModifiedDate, DateStamp) VALUES('" + FileCorrectPath + "', '" + hashlib.md5(FileHash).hexdigest() + "', '" + time.strftime('%Y-%m-%d', time.gmtime(FileInfo.st_ctime)) + "', '" + time.strftime('%Y-%m-%d', time.gmtime(FileInfo.st_mtime)) + "', '" + time.strftime('%Y-%m-%d') + "')") except: EndTime = time.localtime() print time.mktime(EndTime) - time.mktime(StartTime) cnxn.close() print os.path.join(root, name) raise try: del(FileInfo) del(FilePath) del(FileHash) del(FileCorrectPath) gc.collect() except: raise
My problem comes from the
FileHash = open(os.path.join(root, name), 'rb').read()
On certain files when I run that I receive an error
This particular file is about 500 MB in size, so I can see why it is dieing, but so far I have not learned a way to gracefully handle these errors and keep on processing, nor have I learned how to fix these errors.K:\Altium Support Files\Altium Designer 6 Updates\Build 6.8.1.11735\AltiumDesigner6Update(9346to11735).exe
Traceback (most recent call last):
File "C:\Python\Hash.py", line 16, in <module>
FileHash = open(os.path.join(root, name), 'rb').read()
MemoryError
I must admit, I don't really understand garbage collection or how to properly clear out all of my variables in Python, so that might be part of the problem.
Is there any other way to pull file hashes without reading the whole file into memory? Or is there a better way to handle exceptions?
Links, books, critisisms, and help are all welcome.




Bookmarks