SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    SitePoint Evangelist
    Join Date
    Dec 2003
    Location
    Arizona
    Posts
    411
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Efficient Method of Searching Multiple XML documents without using a database

    I am interested in building a CMS that uses XML for data storage and XSLT for transforming the XML into XHTML. Originally I was going to use MySQL but I would like to learn more about the effectiveness of XML for CMS type applications. The only deficiency in using XML for data storage is searching. I figured out a way to use XPATH/XSLT to do a search which involves taking XML input (search terms and documents to search) and uses XSLT and XPATH (document() function) to iterate through documents and look for search terms. The result of the transform is another XML document that contains the search results which can then be sorted or paged or whatever using XSLT to transform it to XHTML final output. This search method seems to work fine for a simplistic search on only a few documents. I could make it more efficient by only searching a keywords field in the target XML document which PHP has determined to be the words that occur most in the content, but it still has to go through all documents so it doesn't help much. I could index the documents using MySQL, but I am trying to do this without a RDBMS. So, I was thinking of creating an XML-based inverted index (index that contains terms and which documents contain these terms) where there is one file for each letter of the alphabet (a.xml, b.xml, c.xml, etc...). When a user submits and article, PHP indexes that article by searching the content and determining word frequency, etc. Then PHP makes entries into the appropriate files (a.xml, b.xml, c.xml etc.) probably by using DOMXML. Then, given search terms (in the form of XML), an XSLT file would look through the index and gather results. Boolean search operations (AND/OR) could be accommodated by using set operations on resulting node sets in XSLT. The output of the XSLT operation on the search terms and the index documents would be xml data containing search results which would then be transformed into XHTML. I know this seems like a lot, but I am really trying to exercise my skills in writing effcient XSLT/XPATH to help me understand more about when to use XML. Can anyone here think of a more intuitive/efficient way of indexing/searching multiple XML documents without using a database?

    Thanks,

    JT

  2. #2
    SitePoint Guru
    Join Date
    Oct 2001
    Posts
    656
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I believe you can say that there is no intuitive and efficient way for searching XML documents without using a database, and definitely not with the technology currently available for use with PHP.

    For the problems you describe you should really use an SQL database. XML is a good technology for communication between applications on different servers, but in my opinion it has no place for data storage & retrieval.

    Not a lot people know it, and I didn't either until I read it, but in the 60's people were using database systems that had a hierarchical data model, much like XML. So many complexities arose because of that data model that they decided to quit using it, that was when the relational datamodel was invented, something with strong fundaments in mathematics. For more information on this see Database Debunkings. The guy is not against XML as a whole, but against using it for what you should use a relational database for (even though SQL databases are not based on the relational model, but that's another story to read on that site ).

    I am interested in building a CMS that uses XML for data storage and XSLT for transforming the XML into XHTML
    I don't mean any disrespect, but often when people write they "want to write a [something-system] and I want to use [this and that technology]" I wonder, why. You should look at the technological needs of your application and then choose technology that fulfill those needs to the best, not the other way around.

    All of the problems you describe that are difficult to implement using XML as data storage are already solved for you when you use a SQL database. Therefore it seems to me that by using XML you are only making things more difficult for yourself.

    Well, just my 3 cents

  3. #3
    SitePoint Evangelist
    Join Date
    Dec 2003
    Location
    Arizona
    Posts
    411
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Captain Proton
    I believe you can say that there is no intuitive and efficient way for searching XML documents without using a database, and definitely not with the technology currently available for use with PHP.

    For the problems you describe you should really use an SQL database. XML is a good technology for communication between applications on different servers, but in my opinion it has no place for data storage & retrieval.

    Not a lot people know it, and I didn't either until I read it, but in the 60's people were using database systems that had a hierarchical data model, much like XML. So many complexities arose because of that data model that they decided to quit using it, that was when the relational datamodel was invented, something with strong fundaments in mathematics. For more information on this see Database Debunkings. The guy is not against XML as a whole, but against using it for what you should use a relational database for (even though SQL databases are not based on the relational model, but that's another story to read on that site ).

    I don't mean any disrespect, but often when people write they "want to write a [something-system] and I want to use [this and that technology]" I wonder, why. You should look at the technological needs of your application and then choose technology that fulfill those needs to the best, not the other way around.

    All of the problems you describe that are difficult to implement using XML as data storage are already solved for you when you use a SQL database. Therefore it seems to me that by using XML you are only making things more difficult for yourself.

    Well, just my 3 cents
    I realize that XML may not be the appropriate technology. However, in the efforts of learning efficient XML/XSLT, I am attempting to build this system because I look at it as a challenge. I have already figured out some efficient methods for searching XML using XSLT which I would like to extend to implementing some sort of XML-based inverted index. Search engines typically use these types of indexes. If I can index my files and use XML as the format for the index, I am then relying on the perfomance of the XML parser/XPATH implementation. Typically, a database implementation will use B+ trees as one of its primary data structures which are more efficient for loopups that N-ary trees (DOM data structure). However, there are efficient algorithms for searching N-ary trees. Although it will NEVER be as efficient as a database, I believe I can still get some performance out of it and that is the challenge I have set out for myself. Basically, it is one big academic exercise.

    Thanks,

    JT

  4. #4
    SitePoint Enthusiast
    Join Date
    Jan 2004
    Location
    Manchester
    Posts
    32
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I've been playing with XML/XSL/XPath for about 2 years. Also, I've built 5 CMS apps over the years.

    I've (personally) realised that you've two types of web application/sites: document-centric and data-centric.

    Data-centric apps/sites can be illustrated with Amazon.com. Essentially, they store data on items/objects (products in Amazon's case), and use this information along with tools (shopping carts, bookmarking, etc) to provide a service (e-commerce/e-shop).

    Document-centric apps/sites can be illustrated by thinking of SitePoint. They store documents (articles/tutorials in the case of SitePoint).

    Now, (IMHO) there's a fundamental difference in the way documents and data should be stored. Documents can be broken down and referenced better using XML - think about the way HTML stores the information in a web page. <title /> holds title information, <blockquote /> holds quotes. You can have more than one in a document. They might have specific attributes, such as <blockquote cite="" /> which holds the link to the original writing.

    Data is more ridgid. It has a set size, for example currency information for the UK is stored as a float with two decimal places.

    With this in mind, my last CMS was built with document-centricity. I used an XMLDB to store documents, XSL to translate to XHTML, and PHP to do the brunt of the work. But my CMS was built to store documents - web pages. There was no shopping cart add-ons, no "business login". It just allowed my client to produce web-pages easily, quickly, and in a certain format which was what they wanted. They have further plans for their system, which will be easily implementable being that all their content is stored in XML. Lucky sods...

    Now, for indexing, XMLDBs such as Apache's Xindice and DBXML might be worth looking at. They also support XUpdate, XML-RPC, and (i think) XQuery, which are all very useful.

  5. #5
    ********* wombat firepages's Avatar
    Join Date
    Jul 2000
    Location
    Perth Australia
    Posts
    1,717
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I am with Captain Proton & Neobuddah on this , & whilst I appreciate you wanting to take on the challenge for its own sake its hard for anyone to offer help when its counter-productive , eg you may ask how best to get a square wheel to roll down a hill

    Probably a steep enough hill would do the trick... but the answer is use a round wheel.

    By the time you build enough indexes and query logic you will have in fact built a database !

  6. #6
    SitePoint Evangelist
    Join Date
    Dec 2003
    Location
    Arizona
    Posts
    411
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by firepages
    I am with Captain Proton & Neobuddah on this , & whilst I appreciate you wanting to take on the challenge for its own sake its hard for anyone to offer help when its counter-productive , eg you may ask how best to get a square wheel to roll down a hill

    Probably a steep enough hill would do the trick... but the answer is use a round wheel.

    By the time you build enough indexes and query logic you will have in fact built a database !
    I have actually figured out a pretty decent method of querying documents using XSLT/XPATH to process them. I am scaling the scope of the project down a little to a weblog type application instead of a complete CMS in which the primary datastore will be XML which will be transformed into XHTML via XSLT. The static XHTML pages will be "cached" and only updated when a change is made to an entry. As far as searching is concerned, I am going to do some metadata searching and will avoid full text-searching for the time being. Eventually, I anticipate using a native XML database for efficient information retrieval. Thank you everyone for your input.

    JT


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •