a view into the sordid life i lead

Wednesday, October 03, 2007

sharepoint to mediawiki

I've had to recently convert a bunch of sharepoint pages to mediawiki. The reason is primarily that sharepoint is not really a wiki - it's a document repository system, much like alfresco and confluence, but in my opinion a poorer solution in many respects. The main limitation for me is the way sharepoint treats documents. It handles them as just HTML pages. Not that there's anything wrong with that, but it does bring up some issues (which I believe wiki's do a great job of solving).

Foremost is that enterprise documentation needs to be structured in a way that keeps things consistent. The use of markup for text embellishment (font color, shape, size, location) are to be avoided in order to keep documentation clear and simple. This is definitely possible, but with the existence of a wysiwyg editor it's often tempting to slap on different font sizes to make a point.

The problem this introduces in sharepoint is that sharepoint does not have a centralized stylesheet manager. Each page might contain specific styles. For the most part this is not a problem - after all, people just want to view the text, and when they edit it it's via a wysiwyg editor. The real issue is when you're trying to index the text for search, and export it to another format.

In my case to MediaWiki (which I believe is the better documentation management system [note document vs. documentation]).

There are a few different migration tools I've come across:
* Word2MediaWiki
* Word2MediaWikiPlus
* HTML::WikiConverter (perl module)
and the most recent addition
* OpenOffice 2.3 MediaWiki exporter

When I first looked at converting a SP page to MW, I figured I could just read the HTML and do an HTML->Wiki conversion. Since MS adds a proliferation of style-related junk this is not so straight-forward. My opinion is that font-effects should be tied to styles, and so now the issue is to remove all extraneous styles, and just keep the basic document information and hierarchy.

The conversion tools unfortunately assume that the source HTML is relevant. This is a problem because the wiki page will contain all this style junk which is going to interfere with convenient editing (the whole point of a wiki).

The latest tool I've run into is OpenOffice 2.3. 2.3 introduces MediaWiki export functionality. I have been thrilled by it, since it does the sensible thing - strip out all extraneous style crap, and just give me plain wiki text. The problem is how to automatically read all the pages, keep their relationships, and convert into MediaWiki pages and links?

My thinking currently is to use Perl, UNO (the OpenOffice automation bindings), and the Win32::IEAutomation module. The latter is for reading each page and storing the source HTML as a separate file. UNO should allow me to open each HTML page, and export just the MediaWiki marked-up output.

The remaining issue is to maintain the document hierarchy and relationships. THere are 2 possible approaches:
* Read the sharepoint database and see how it stores document relations
* Walk the document link tree using IEAutomation, and create the MediaWiki documents from there

Both are possible, and neither is terribly difficult. But they'll both take time.

As I find out more about this I'll be documenting it on this blog. It's something I suspect others will want to do.

At this point I would say that I would recommend against the use of Sharepoint for enterprise document management, but obviously that's based purely on my experience. If I had to do it over again I'd probably go with Alfresco or Confluence (which are both far easier to integrate MediaWiki into, but also have "real" wiki solutions built-in).

Powered by ScribeFire.