[Discuss] Please help with a sed script (Bill Horne)
Dale R. Worley
worley at alum.mit.edu
Tue May 25 21:08:36 EDT 2021
> From: Bill Horne <malassimilation at gmail.com>
> Here's the table-of-contents from a typical day:
>
> * 1 - [telecom] Can robocalls be tracked? - "bob prohaska"
> <bp at remove-this.www.zefox.net>
> * 2 - Re: [telecom] Can robocalls be tracked? - Bill Horne
> ? <malQRMassimilation at gmail.com>
> * 3 - [telecom] Verizon Media debuts ad-targeting solution without
> identifiers
> ? - Moderator <telecomdigestsubmissions at remove-this.telecom-digest.org>
First off, you're not specifying how the line breaks work. Are the line
breaks we see here actually in the ToC text, or are they just an
artifact of how you inserted it into this e-mail message? I ask because
line-breaking is one of the harder things to get sed to change, so we
should be clear about it.
> And here's what I'd like to change it to, using (if possible) sed:
>
> (tr)(td)Can robocalls be tracked?(/td)(/tr)
> (tr)(td)Re: Can robocalls be tracked?(/td)(/tr)
> (tr)(td)Verizon Media debuts ad-targeting solution without
> identifiers(/td)(/tr)
>
> ("less-than" and "greater-than" symbols have been changed to
> parens here for obvious reasons.)
It's not quite clear why, as < and > are transparent in ASCII e-mail,
except for the first column.
> Things to note:
>
> 1. The Subjects lines vary in length, and may contain hyphens.
> 2. The name and email of the contributor is also published with the
> actual post, further on in each digest, so it doesn't have to appear
> in the Table of Contents.
> 3. The "m" option of sed, which the manual says will do a multi-line
> "s" command, doesn't appear to work on the OS I'm using, which is
> Ubuntu 16 LTS.
You should do "sed --version" and report what it says.
The above example suggests that the title is separated from the
contributor by " - ", but you don't say that. The contributor appears
to be optional. And you don't specify whether the sequence " - " may
also appear as part of the title, which means parsing the two apart is
ambiguous.
The final "Moderator" line is distinguished how? It appears that item
lines start with "\* [1-9][0-9]* - ". Does the Moderator line start
with " \? - "? That is, how do we distinguish it from a continuation of
the preceding title?
As others have noted, it's likely easier to generate the HTML form you
want from an earlier stage of processing, one where there's a data
structure that rigidly differentiates each title, and ideally, separates
the titles from the contributors. But if you can't do that, the first
step is to really nail down how you parse this text apart conceptually.
After that, it's much easier to implement the transformation.
Dale
More information about the Discuss
mailing list