Cleaning HTML while migrating - Remove Microsoft HTML
Published on April 18, 2012
Recently I had to move bunches of tables from an old system to a Drupal site. The table data was heavily infested with the crappy HTML inserted from Microsoft Word.
The MS HTML was 1) Redundant and making the HTML almost 5 times its actual size and 2) Breaking the page HTML on the new system, at times.
Used the htmLawed Library from http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/in…
Once included, it is as simple as:
<?php
$cleaned_html = htmLawed($dirty_html, $htmlawedsettings);
?>
$htmlawedsettings
can carry a multitude of settings as explained in http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/ht…
However, at a minimum, you can have:
<?php
$htmlawedsettings = array(
'clean_ms_char' => 2
);
?>
Last but not the least, as the migration script was a Drupal module, include the htmLawed.php
file (placed in the same folder as the .module
file) like this:
<?php
module_load_include('php', 'your_module_name', 'htmLawed');
?>
There you go. Sparkling clean HTML that is close to being w3c compliant!