SitePoint Sponsor

User Tag List

Results 1 to 11 of 11
  1. #1
    SitePoint Zealot
    Join Date
    May 2004
    Posts
    142
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    MS Word to XML to HTML?

    Hey everyone,

    If all goes well I will have a job developing a php/mysql website for a small local newspaper. Designing the site isn't going to be a problem, but I have one issue.

    The client will have his staff email/upload articles each day to the server for entry into the database. These articles will be in MS Word format and may include pictures, and from this I must program a script to convert it to HTML for display on-screen..

    My initial thought was to convert the Word doc to XML and store that in the database, then convert to HTML etc for display when the data is pulled from the database.

    The thing is, I have no idea what kind of schema MS Word uses (or what version of Word the client is using - I understand the latest version uses a different schema?) nor have I had any experience with this.. for example, if there's an image embedded in the Word doc, how is that stored in XML?

    Can anyone help me? Would it be easier to ask the client to send documents in another format, e.g. RTF and send the images seperately?

    Any help at all is much appreciated, thanks in advance!

  2. #2
    SitePoint Zealot
    Join Date
    Jun 2007
    Location
    Regina, SK, Canada
    Posts
    129
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Personally, I would implement a Moxiecode TinyMCE WYSIWYG editor at http://tinymce.moxiecode.com/. Then set it up for the person to paste their word doc into that textarea and they can upload images through that. This is all automatically converted to html. Then they just hit save and it is thrown into the database, then just pull from the database to display on the page.

  3. #3
    SitePoint Zealot
    Join Date
    May 2004
    Posts
    142
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    That looks interesting indeed - however these users are not too savvy with other programs and I believe this might be 'too much work for them'. I will look into it as a solution though, thank you!

    Anyone else?

    to clarify: the users want to simply make their word doc and either send it to me or send it to the website.. and no more work than that. I will obviously make a form on the website for them to upload (say a zip file of articles) and then get this archive, unzip, take each article and prepare it for the database, then insert it.. (as part of the processing script)

  4. #4
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,875
    Mentioned
    25 Post(s)
    Tagged
    1 Thread(s)
    Word is the WORST possible format for trying to convert to web pages as it is the format that is hardest to convert into HTML.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">

  5. #5
    SitePoint Wizard Hammer65's Avatar
    Join Date
    Nov 2004
    Location
    Lincoln Nebraska
    Posts
    1,161
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If it's a binary Word format, you are pretty much out of luck. If it's the XML version, theoretically you could use XSLT to do the conversion to XHTML, but the images would not work that way.

    An in page editor like TinyMCE is really the best route. Most CMS systems use these for content. In fact this forum uses just such an editor for it's posts.

    I would strongly recommend using one and in fact I would recommend checking into a CMS.

    Systems like XOOPS, Drupal, Joomla are well suited for media sites. It can be much easier to fine tune and modify these systems than to code your own sometimes.

  6. #6
    SitePoint Wizard bronze trophy Kailash Badu's Avatar
    Join Date
    Nov 2005
    Posts
    2,560
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    If you can put up with having to upload the entires youself:

    Get the authors to send the documents to you. You save the word document as a web page from MS WORD. Now run HTML 'Clean-up' tool that comes with editors like DreamWeaver which will eliminate all unnecessary HTML tags imposed by Word . Now paste the result into the WISIWYG editor in your website.

    Quite a work, huh? Not really. It takes 1.5 minutes at most.

  7. #7
    SitePoint Wizard Hammer65's Avatar
    Join Date
    Nov 2004
    Location
    Lincoln Nebraska
    Posts
    1,161
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This would work unless of course the idea is to develop the site and then release it to them, to maintain. That is normally the arrangement.

    Incidentally, aside from the fact that no tool MS comes up with that writes HTML does it worth a damn, the reason Word HTML has to be cleaned up is that it produces what MS calls "roundtrip HTML". The idea being, that by including all the junk they put in the document, if you ever want to convert it back to Word format, with all of the formattting, you can without losing anything. Why you would ever want to do that, I don't know, but you will see some crazy stuff in those docs, that you better hope Dreamweaver can deal with.

    I would still go with the CMS option. You charge them for your time in installing, setup, modifications and traning, they get proven software, that they can use to maintain their own site.

  8. #8
    SitePoint Wizard bronze trophy Kailash Badu's Avatar
    Join Date
    Nov 2005
    Posts
    2,560
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Hammer65
    I would still go with the CMS option. You charge them for your time in installing, setup, modifications and traning, they get proven software, that they can use to maintain their own site.
    Offcourse, We'll need a CMS either way. But the question is how would articles be fed to the CMS. It's always preferable to release the software away and let the client take care of uploading etc. . However, if your clients are not in the position to enter articles through built-in WISIWYG editor, you also have a few less preferable options. One of them being , uploading the article on your client's behalf by cleaning up the Word document/ or training them on how to do it themselves. You know what works best for you.

  9. #9
    Theoretical Physics Student bronze trophy Jake Arkinstall's Avatar
    Join Date
    May 2006
    Location
    Lancaster University, UK
    Posts
    7,062
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Hmmm....

    Do you know C# application programming?
    I had a problem with the client uploading their product excel file weekly. This gave me two problems. One was that the file was over 2mb, therefore PHP couldn't upload it, and the second was that they didn't want to have to convert the XLS file to CSV, the latter PHP can read.

    So, I made a program in C# where the user would browse the file.
    If it is XLS then the program converts it into CSV, comma delimited. Then, I sent it to a PHP page, POSTing the data rather than the file. The PHP page uploaded the data into the database. Problem solved.

    If you can convert that DOC file into something more readable (C# comes with Interopability, so you can convert Office files into compatiable files), then upload it to a PHP page, then you won't need to worry.

    Also, im sure that MSWORD allows a user to export to HTML, so if they could upload that and you strip off the nonsense tags, then that would also work.
    Jake Arkinstall
    "Sometimes you don't need to reinvent the wheel;
    Sometimes its enough to make that wheel more rounded"-Molona

  10. #10
    SitePoint Zealot
    Join Date
    May 2004
    Posts
    142
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Some great ideas guys, thanks very much!

    I live in England whereas the client is in Cyprus (different time zone), and they need their content published at the equivalent of 6am here each day, so manually uploading the articles myself will be a pain, and sometimes simply impractical/impossible. Therefore I need an automated solution.

    Since their machines are also out of practical reach, I can't install any specific software on them (nor would I trust the people there to do it properly) - I liked arkinstall's idea but it seems impractical for this reason.

    The CMS idea seems to be the best solution therefore. Thanks to all!

  11. #11
    SitePoint Wizard Hammer65's Avatar
    Join Date
    Nov 2004
    Location
    Lincoln Nebraska
    Posts
    1,161
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    CMS systems have web based interfaces, unless the person doesn't have internet access (which is pretty unlikely) there is no barrier to doing this. It no more complicated than making a post to this board. The company I work for has done numerous installations of XOOPS for TV stations, and the clients are quite happy with them. Training is not that difficult, even for non technical users.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •