MS Excel 2007 .xlsx file parsing

So I have .xlsx Excel file and I would like to parse this with php. Is this parsing possible and if it is could someone give me a pointer where to start looking?

OFFTOPIC: Is it possible to open these files with OpenOffice ? Since I don’t have office.

IIRC, the X formats (xlsx, docx, etc) are XML-based, so what I’d recommend is that you learn how to read an XML file with PHP, if you haven’t done so already, and give it a go yourself.

Or, you can try this.

Are you saying that I need nothing more than just parsing XML file? Would this go then with SimpleXML for example?

Edit: btw what means IIRC ? :slight_smile:

Edit2: Would be also nice if I could open this file somehow, so I can see what kind of content it has and this would make it easier to parse it then and be sure I get the results I am expecting.

  1. IIRC means If I [Recall/Remember] Correctly.
  2. You can open the file in notepad or another text editor.

Tried opening in notepad and notepad++ but its not readable. Some wierd characters exploded all over the place.

Edit:
Now I was able to get the file open with OpenOffice. Installed the MS compatibility pack for office and after that updated windows from windows update. So now I am able to convert .xlsx to .xls and open it with OO.

It isn’t as straightforward as parsing the XML, because there’s zipped data within that XML.

However, there are PHP libraries around which will allow you to access (read and even write) Excel files. You may find it easier to use one of these

I am quite a lost here now… how to make this PHPExcel work properly. I have a test.php and I copied Classes dir content to same dir with test.php which includes files PHPExcel.php and a dir PHPExcel.


include('PHPExcel.php');

$objReader = new PHPExcel_Reader_Excel2007();
$objPHPExcel = $objReader->load("05featuredemo.xlsx");

And apparently it can’t find some required class… but I pretty confused what I should copy and where related to the file where from I want to process Excel files (in this case test.php). This is what I get…

Fatal error: Class ‘PHPExcel_Reader_Excel2007’ not found in C:\…\ est\ est\ est.php on line 4

Mkay… found some of your old post Mark Baker and it helped me over that error. Now the code looks like this


/** PHPExcel_IOFactory */
include 'PHPExcel/IOFactory.php';

//$excelFileName = '05featuredemo.xlsx';
$excelFileName = 'data_sheets.xlsx';


//$objReader = new PHPExcel_Reader_Excel2007();
//$objPHPExcel = $objReader->load("05featuredemo.xlsx");

if (file_exists($excelFileName)) {

    $objReader = PHPExcel_IOFactory::createReader('Excel2007');
    $objPHPExcel = $objReader->load($excelFileName);
}

Now I get another error… is this too big of a file to be read with PHPExcel or what is going on?

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 6648 bytes) in C:\wamp\…\ est\PHPExcel\Style\Color.php on line 1212

PHPExcel is currently quite memory hungry, and the larger the workbook, the more memory it demands.
The latest release (Version 1.6.7) needs approx. 10KB PHP memory per cell: the latest source code in SVN, and the next release (version 1.7.0) needs approx. 1KB PHP memory per cell. That means PHPExcel 1.7.0 will be able to handle around 10 times larger workbooks using the same amount of memory. The new calculation engine, when it’s fully integrated with the core code, will reduce that still further.

But we know that it is still not good enough, and it’s why we are looking at how to implement cell caching (disk / database), which we believe will solve essentially all memory problems with large-scale workbooks.

I don’t know how large your workbook is, but you can use those figures as a rule of thumb for calculating how much memory is needed.
Depending on what you want to do with the workbook once you’ve loaded it up into PHP, you can also set the loader to “read only”, or selectively load individual sheets rather than the entire workbook, which can also help make best use of your available memory

Yes this problem got solved now atleast temporarily by increasing the php’s memory limit. ini_set(‘memory_limit’,‘64M’);

Pretty crazy though… 128 MB was the fist one which was enough. After I added this line:
$objReader->setReadDataOnly(TRUE);
…64M was enough. That helped quite a bit I guess(?)

I also read that from this another post there is no (atleast not yet) support for macros etc advanced stuff (don’t remember what was the other one). Anyway… the question is if there is macros used in some document and you want to read/write it with PHPExcel is it possible? In other words… does macros etc. break the whole process somehow or is it still possible to do normal operations with the document?

btw. filesize is 396KB is that big for a excel document?

No, macros aren’t supported. When the workbook is read in, any macros are discarded. VB code cannot be run from within PHPExcel (I’m just beginning to look at the feasibility of that now)… and rewriting the calculation engine has taken me 4 months, so I wouldn’t hold your breath waiting for macro support. If the workbook is then written, any macros will be lost

It is for many people, for others it’s quite small. The test workbook I’m currently using for my development is a single worksheet, about 2000 rows, 15 columns, and 103k file size. In the work environment, I’m dealing with 30-45k rows and 250 columns, and have my PHP memory set to 512MB.

Mark, do you know how the zipping is involved in .xlsx files? Is it just a zipped .xml file or is it more complicated (probably it is more complicated)? But if it would be just a zipped xml file wouldnt it be possible for me to unzip it first and then read with with like normal XML file in theory?

For simplicity(and processing)'s sake, I’d convince the users to save it as a csv by now :wink:

That or write the site in ASP.Net, which has office parsing utilities built in. Another option is to write a Java/Flash XLSX parser which outputs the file as a CSV before uploading, which PHP then reads.

PHP is a great text parser. But PHP really isn’t strong in terms of Office, due to MS’ typical ‘who needs a standard?’ thinking.

Unzip the xlsx manually, and you’ll find a whole host of XML files (e.g app.xml, calcChain.xml, core.xml, sharedStrings.xml, styles.xml, workbook.xml, sheet1.xml, etc)
Those files is readable in a text editor, as XML data, and can be parsed as such. You do need to watch out for shared styles, and formulae though, because you’re having to reference XML data that’s split across several files

You may find some encoded data within the XML, but that’s primarily images

I don’t think I’d really blame MS for that.

<rant mode>
The Java bods (and Gnumeric developers) basically reverse engineered Office to identify how it worked, how the files were structured, etc… then developed their own tools and applications based on that knowledge. It’s better now that MS have released all the details of their proprietary formats, although most people still aren’t aware that the documentation is now publicly available; but the fact remains that it took a few people a lot of time and effort to develop those tools.
It seems to me that most most PHP developers are more reluctant (there are exceptions) to even consider reverse engineering something like Excel to write their own parsers, or develop tools to work with those complex file formats. Perhaps they think it’ll take too long (it will), or that the format is too complicated (it certainly isn’t easy); but that’s no reason not to try.

I think it’s more like a lack of initiative: and if there isn’t already a library that someone else has created, then they simply pretend that it isn’t needed by working round the problem with CSV instead, or writing html files with an xls extension. They’re focused on the quick and easy result, and not interested in something that may take months (or even years) to write.
PHP is so easy for generating simple web pages, and manipulating a bit of text, that every man and his dog is a PHP developer these days - witness some of the most basic questions here on sitepoint or other PHP developer forums; but only a tiny percentage are capable of building more complex or large-scale libraries or applications. And most are just focused on what they need for the immediate resolution of their current problem, and unwilling to even consider the development of something that will take more than a few weeks.
</rant mode>

Ill just continue in this thread. I tried the example in the docs to iterate through all cells and output them into html table. Now this document we discussed earlier the same one, I can’t get toprint. It just takes forever and nothing is output. After I converted the .xlsx to .xls it can parse it in pretty reasonable time.

#1 Question was that I have tried changing the value of this setting to true/false:

$cellIterator->setIterateOnlyExistingCells(false)

but everytime I try “true” I get the following error, is it normal or?

Fatal error: Call to a member function getValue() on a non-object in C:\wamp\www\ est\ est.php on line 30

#2 Question about searching. How would I perform a search based on cell value and then return the whole row where it matched ?

#3 If I need to search a lot of rows by cell values would it be better just to parse the file as a whole once and then use DOM or something to find the needed content from the parsed document ?

Can you show us the code you’re using?

I haven’t used any yet… for the search part I mean. I am reading the documents at the moment and trying to find something to search with. But this is the code what with I read the sheet and output it into browser (it is just a test).


// Expand memory limit.
ini_set('memory_limit','512M');
// Expand execution time.
set_time_limit(1000);

/** PHPExcel_IOFactory */
include 'PHPExcel/IOFactory.php';

// Data sheet.
// This one is .xls because I couldnt get .xlsx parsed, took forever and apache got stuck.
$excelFileName = 'MSI_data_sheets.xls';

// Create reader.
$objReader = PHPExcel_IOFactory::createReader('Excel5');
// Set read only (much faster).
$objReader->setReadDataOnly(true);

// Load data sheet.
$objPHPExcel = $objReader->load($excelFileName);
$objWorksheet = $objPHPExcel->getActiveSheet();

// Iterate through the data sheet and print all cells into table.
echo '<table border="1">' . "\
";
foreach ($objWorksheet->getRowIterator() as $row) {
  echo '<tr>' . "\
";
		
  $cellIterator = $row->getCellIterator();
  $cellIterator->setIterateOnlyExistingCells(false); // This loops all cells,
                                                     // even if it is not set.
                                                     // By default, only cells
                                                     // that are set will be
                                                     // iterated.
  foreach ($cellIterator as $cell) {
    echo '<td>' . $cell->getValue() . '</td>' . "\
";
  }
  
  echo '</tr>' . "\
";
}
echo '</table>' . "\
";

Browsed through the documentation and didn’t really find anything related to searching. Closest one I think is PHPExcel_Reader_IReadFilter, but I think you can’t use this for searching, only filtering the amount of rows being read and defining the exact rows that you want to read.

And now I just noticed another thing which is huge in this case for me. The document has 5 tabs in total and every tab includes different hardware informations. The parser only parsed the first one which opens as default when you open the document in openoffice. Any idea if it is possible to read the data from all existing tabs?

Edit:
Sort of found a way to parse the all 5 tabs, but requires saving the file again 5 times. Setting the tab active which you want to parse and then save the file.