[Discuss] Converting "rich" (MIME) email to plain text
Chuck Anderson
cra at WPI.EDU
Wed Feb 17 13:18:35 EST 2016
On Wed, Feb 17, 2016 at 11:39:22AM -0500, Michael Tiernan wrote:
> I'm sure that I'm not the first who tried to find an easy way to
> filter a piece of email so that only the plain text comes out.
>
> I can find lots of things about going plain to HTML but I've not
> seen anything that allows you to just extract the "Content-Type:
> text/plain" section of an email.
>
> Any pointers available? I don't want to try and reinvent the
> reinvented wheel.
Here is what I use with Mutt to get lightly-formatted text and
unobfuscated links. It isn't perfect, but it works acceptably 90% of
the time and it avoids downloading any remote links which was my
primary goal.
>grep mailcap .muttrc
set mailcap_path = ~/.muttmailcap
set mailcap_sanitize
>cat .muttmailcap
text/html; /home/cra/bin/striphtml.pl; copiousoutput
text/calendar; /home/cra/bin/vcalendar-filter; copiousoutput
>cat ~/bin/striphtml.pl
#!/usr/bin/perl -w
use HTML::Strip;
use HTML::LinkExtor;
use HTML::Entities qw/decode_entities/;
use URI::Escape qw/uri_unescape/;
use Encode qw/from_to/;
undef $/;
my $html_text = <ARGV>;
my $charset = 'UTF-8';
if ($html_text =~ /\ncontent-type:\s+text\/html;\s+charset=(.*)/i) {
$charset = $1;
$charset =~ s/\"//g;
} else {
print "no char set\n";
#print $html_text;
}
$html_text =~ s/<br>/\n/gi;
$html_text =~ s/<p>/\n/gi;
my $hs = HTML::Strip->new();
my $stripped_text = $hs->parse($html_text);
my $decoded_text = decode_entities($stripped_text);
$decoded_text =~ s/\n\s*\n/\n\n/g;
$decoded_text =~ s/\n\n+/\n\n/g;
$decoded_text =~ s/\240/ /g;
$decoded_text =~ s/\r//g;
#$decoded_text = decode($charset, $decoded_text);
###from_to($decoded_text, $charset, 'UTF-8');
my $hl = HTML::LinkExtor->new();
$hl->parse($html_text);
my @links = $hl->links;
print "Charset: $charset\n";
print "Message:\n\n";
print $decoded_text;
print "\nLinks:\n\n";
foreach my $link (@links) {
printf "%-7s %-15s %s\n", $$link[0], $$link[1],
uri_unescape($$link[2]);
}
More information about the Discuss
mailing list