Good Word doc -> plain text conversion
Gordon Marx
gcmarx-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Sun Sep 19 20:01:01 EDT 2010
You know who's totally psyched about this email? Susan Cutright and
Rebecca Sniderman...
On Sun, Sep 19, 2010 at 8:01 PM, <jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org> wrote:
> Dan Ritter wrote:
> | antiword is the usual candidate. Every one of Google's first ten
> | results for that are relevant.
>
> Yeah, I thought of that, too, but I was hoping there might be something that
> does a better job. In one of my current sample .doc files, for example,
> antiword produces the curious table entry:
>
> | CUTRIGHT, Susan |11 Arlington Road | (781)209-9877 |
> | |Waltham, MA 02453 |susan.cutright at ASPENTECH|
> | | |.com |
>
> Note the "wrapping" of the email address, with the ".com" on a separate line.
> When Word displays this on a Windows screen, this wrapping doesn't happen.
> The 3rd column strings are actually centered, and the email address is
> whole.
>
> After a bit of exploring, I found that the -w option works to get a wider
> "page" size, and this entry actually works, but others in the file don't.
> When I tried things like "antiword -w 200 <file>", it decreases the width
> to 138, which seems to be the widest "page" that it believes possible. So
> later in the same file, I get the following 138-char-wide chunk:
>
> |SNIDERMAN, Rebecca |MB 1794 Brandeis University P O Box | rsnider-1FONPbNgvBv2fBVCVOL8/A at public.gmane.org |
> | |549110 | |
> | |Brandeis University | |
> | |Waltham, MA 02454-9110 | |
>
> Note the bizarre 4-line address, with just "549110" on the second line. Of
> course, the sensible thing would be to remove the first "Brandeis University"
> from the address, but that's what's in the file, and there are other entries
> with quite long addresses. I tried to write a perl parser that would handle
> all the entries in this file and a couple of others, and after an afternoon
> of hacking at it, I still haven't quite succeeded. Such spurious line
> wrapping, including things like splitting ".net" into ".n" and "et" in one
> case, can be one of the trickier kinds of damage to fix.
>
> I wonder if there's a clean fix to this sort of problem?
>
> (And why a max of 138 chars? That's a rather bizarre number.)
>
>
> --
> _'
> O
> <:#/> John Chambers
> + <jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org>
> /#\ <jc1742-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>
> | |
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss
>
More information about the Discuss
mailing list