The problem is deleting the MS Word 2007 style tags. I don't know if Word 2000 tags and 2007 tags are the same, I believe they aren't. So I cannot delete the following types of text automatically using TidyHTML:
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:DontVertAlignCellWithSp/>
<w:DontBreakConstrainedForcedTables/>
<w:DontVertAlignInTxbx/>
<w:Word11KerningPairs/>
<w:CachedColBalance/>
</w:Compatibility>
...
which continues seemingly forever, then:
<style>
<!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;}
.MsoChpDefault
{font-size:10.0pt;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;}
div.Section1
{page:Section1;}
-->
</style>
and this:
<p class=MsoNormal style='margin-left:1.0in;text-indent:-1.0in'>
and this:
<span class=SpellE><span class=GramE>
This is the junk I've been talking about, and TidyHTML will not get rid of it. I tried the TidyHTML web site, I tried the Tidy command line tool and other Tidy tools available with a GUI. It does not strip out this junk. And when I try to do it in RoboHelp, it crashes. I don't know what else I can write to better explain what I mean.
Thanks,
Chris
Chrissy_1234 wrote:
The problem is deleting the MS Word 2007 style tags. I don't know if Word 2000 tags and 2007 tags are the same, I believe they aren't. So I cannot delete the following types of text automatically using TidyHTML:
[Examples snipped]
I took the liberty of looking a little deeper into this for you. As others here have suggested, saving the files from M$Word as "filtered" HTML might help you get past your problem. I know you don't have access to the original M$Word documents, but that probably isn't a problem--you see, the reason that M$Word embedds all that junk into the HTML file is because it thinks you're going to want to re-open the file in Word, and not lose any of the word-specific information. If you have access to M$Word you could simply open the saved HTML in Word, then resave it "filtered". Of course, even Micro$oft "filtered" output still contains more junk that what you want, and for reasons that I once knew but have since forgotten, Tidy does a better job cleaning the unfiltered output than it does the filtered output.
Examining the M$Word output, however, I noticed that most of the junk was embedded in comment tags (<!-- -->). Tidy has the option of removing comment blocks: "--hide-comments yes". So I took a simple 7 page M$Word document and saved it as HTML from Word 2007. I then added these two lines to my "tidy.cfg" file:
hide-comments: y
word-2000: y
After running Tidy on my text document it was very, very clean, and had been reduced in size by about 2/3. All of the elements from the "w:", "m:", and "o:" namespaces were removed, as were all of M$Word's excessive use of the "style" attribute. My build of Tidy is from December 2008.
I would suggest you try Tidy again on your original M$Word .htm files using the foregoing configuration options.This might get you over the hump; if not, let me know and I'll see if I can make some changes to Tidy that will get rid of what is still bothering you.
(If, after importing clean HTML RoboHelp is still crashing on you, I can't offer any further help; at that point it's clearly a RoboHelp problem, not an HTML problem).