Need help parsing CSV file to prep for SQL table import

justinmason72 · March 17, 2015, 5:00pm

I posted this over in the Database forum, but the more I look at this problem and talk to people, the more it looks like Python is the answer. Disclosure: I’m new to Python but I’m running 3.4 and I’m using PyCharm IDE. My problem is this:
I have a CSV file, which contains 3 columns of CSV data “nested” within it. I want to split the contents of those CSV columns out to make a new row for each CSV item, add a sequence number to each item (which corresponds to the order of the item in the sequence within the CSV columns), and then add a corresponding ItemID number that corresponds to the CSV column that it came from. An example of my data will probably help at this point:

#Table1 (the original CSV file).  First row is column names:
StoreID,Date,StoreName,City,State,Category1CSV,Category2CSV,Category3CSV
1051,2/16/2014,Easton,Columbus,OH,"Flour,Yeast,Baking Powder","Milk,Water,Oil","Cinnamon,Sugar"
1425,1/14/2014,Crocker Park,Westlake,OH,"Baking Powder,Yeast,Four","Oil,Milk,Water"

#Table2 (after splitting the CSV column contents). First row is column names:
StoreID,Date,StoreName,City,State,ItemName,ItemRank,ItemCategory
1051,2/16/2014,Easton,Columbus,OH,Flour,1,1
1051,2/16/2014,Easton,Columbus,OH,Yeast,2,1
1051,2/16/2014,Easton,Columbus,OH,Baking Powder,3,1
1051,2/16/2014,Easton,Columbus,OH,Milk,4,2
1051,2/16/2014,Easton,Columbus,OH,Water,5,2
1051,2/16/2014,Easton,Columbus,OH,Oil,6,2
1051,2/16/2014,Easton,Columbus,OH,Cinnamon,7,3
1051,2/16/2014,Easton,Columbus,OH,Sugar,8,3
1425,1/14/2014,Crocker Park,Westlake,OH,Baking Powder,1,1
1425,1/14/2014,Crocker Park,Westlake,OH,Yeast,2,1
1425,1/14/2014,Crocker Park,Westlake,OH,Flour,3,1
1425,1/14/2014,Crocker Park,Westlake,OH,Oil,4,2
1425,1/14/2014,Crocker Park,Westlake,OH,Milk,5,2
1425,1/14/2014,Crocker Park,Westlake,OH,Water,6,2

The Table 1 columns labeled Category1CSV, Category2CSV, and Category3CSV contents map to Table 2 columns: ItemName, ItemRank, ItemCategory, where ItemName is the Item (example: Flour), ItemRank is the order of the item in the CSV list, and ItemCategory is either 1,2 or 3, depending on whether the data came from Category1CSV, Category2CSV or Category3CSV.

The most important aspect of this (other than splitting out the CSV column) is to maintain the order of items from within the CSV columns as they get split to new rows. for example, StroreID 1051 has Category1CSV contents of “Flour,Yeast,Baking Powder”. Those will map to the columns ItemName, ItemRank (the sequence number), and ItemCategory such that ItemName = Flour, it’s ItemRank = 1, and the ItemCategory = 1. This would be the first row in Table 2. The second row would be ItemName = Yeast, it’s ItemRank = 2, and the ItemCategory = 1, and so on until you end up with what looks like Table 2 above. Also, you’ll notice that the ItemRank numbering starts with the contents of the column Category1CSV, then continues to Category2CSV and finally Category3CSV.

Caveats:

not all rows in the original CSV file have items in all 3 CSV columns. For example, Store 1051 had items in all 3 Categories, but store 1425 only had items in Category 1 and 2. Other stores might only have items in 1 Category. Some logic would need to take that into consideration when it’s processing the file (I would think).
my CSV file contains a couple thousand rows. The resulting file will probably have tens-of-thousands of rows.

This will literally save me days of work because I’m doing this manually in Excel and it’s mind-numbing, so I’m hoping someone can help out! If you need any other info, please let me know!

THANK YOU!!!

system · June 17, 2015, 12:12am

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.