[Discuss] My first contribution to MediaWiki
Tom Metro
tmetro+blu at gmail.com
Sat Jan 17 22:10:28 EST 2015
Greg Rundlett (freephile) wrote:
> The project page: http://www.mediawiki.org/wiki/Extension:Html2Wiki
>
> It's an extension to MediaWiki that lets you "import a website or web page
> into your wiki".
"It does this by first "normalizing" the content with HTMLTidy, and
then "sanitizing" it with Purify and Regular Expressions. Then the
content is "converted" from HTML to WikiText using Regular Expressions
and a Parsoid service."
Amazing that such a conversion is even possible, given how problematic
most HTML is. In some ways this job is harder than what browsers do when
parsing HTML, as you aren't just rendering the result, but trying to
extract structure - or semantic meaning - from it.
Does HTMLTidy do a lot of the heavy lifting for you? Do you still end up
with a lot of situations where you have multiple HTML constructs that
map to a single wiki markup construct?
How does it handle HTML generated or loaded by JS, as is quite common
now? (You might be able to work around that with one of the projects
that use an embedded and programmatically controlled web rendering
engine, like webkit.)
What are the advantages to implementing this as a plugin rather than a
separate command line tool (which would then support other markup
formats, like Markdown)?
If you couldn't find an existing HTML to wiki markup converter, did you
look for something similar, like a converter to markdown? A search for
this turns up hits, such as:
http://johnmacfarlane.net/pandoc/README.html
with an example:
pandoc -f html -t markdown http://www.fsf.org
which presumably retrieves content from http://www.fsf.org, specified to
be in HTML format, and outputs Markdown. (It also supports MediaWiki
format.)
If using a tool that doesn't support MediaWiki directly, once in
Markdown, I imagine the conversion to MediaWiki is relatively easy.
-Tom
--
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/
More information about the Discuss
mailing list