tag stripping

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view

tag stripping

Malcolm Brown
I think Erik Naggum's caution in the use of regular expressions
is well-founded. He also makes additional important points. I
think the gist of it is: it may not be so easy to automatically
remove SGML tags from a file.  Hence it would seem that
that job of removing tags in order to produce a human-readable
version may not be as nearly as easy as some have suggested.
Caution is warrented.

Unix nitpick:  the question has been raised as to what kind
of regular expression might match SGML tags the best. Certainly,
as Erik has pointed out, the initial suggestion
is indeed far too destructive.

Isn't a more elegant solution the regexp:

This regexp uses the angled brackets as delimiters.  It matches
the delimiters and also matches any combination of 7bit
characters inside the delimiters *except* the delimiters themselves.

Implemented in awk, this might look like:

      { gsub(/<[u<u>]*>/, "") }
      { print $0 }

This, of course, doesn't solve the multi-line tag, as
Erik has pointed out.

Malcolm Brown

Reply | Threaded
Open this post in threaded view

Re: tag stripping

Michael S. Hart-2
re Malcolm Brown's comments from Stanford

How difficult would be be to produce a thesaurus of various markups to
be searched and deleted?  Accorcing to mbb, it would be very difficult
if not impossible, given the varying ways he outlined, and the manners
in which various persons might find different but equivalent ways that
accomplish the same ends.

Wouldn't it solve these problems if the text was made available in its
original or "pure ASCII" form, as well as in marked up format?

Users would not have to import one edition and change it into another,
thus wasting time as well as disk space as has been suggested would be
the problem with having two editions in the first place.  Requirements
could be decided by the user when getting the file, must as a customer
in a record store can decide on tape, record or CD.

Michael S. Hart

(ref to posting by "naggum"?? in reply to SGML stripping discussion)

Reply | Threaded
Open this post in threaded view

Re: tag stripping

In reply to this post by Malcolm Brown
As a newcomer to this list, I lack a certain amount of background material,
but Malcolm Brown's comment about the difficulty of removing SGML markup---and
the discussion about removing it by others---raises a question for me.  Isn't
the purpose of SGML (ultimately) to produce a text that can be formatted,
searched, italicized where indicated, and the like by any number of systems?
Grover Zinn

Reply | Threaded
Open this post in threaded view

Re: tag stripping

In reply to this post by Malcolm Brown
On the topic of markup-stripping:

Various means of stripping markup by sed or awk or text editors have
been discussed.

Wouldn't it be better to take a publicly available SGML parser (the
NIST parser, say), and modify it so that it only outputs text, and not

Anders Thulin       [hidden email]   {uunet,mcsun}!sunic!prosys!ath
Telesoft Europe AB, Teknikringen 2B, S-583 30 Linkoping, Sweden