Editors for removing SGML coding

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Editors for removing SGML coding

COLLIEAJ
Christopher Currie asks what editors are available for MS-DOS
machines to strip out SGML coding.
One solution is GNU sed, available free in the public domain.
This is a PC-based version of the Unix Stream EDitor and will easily
remove strings matching a specified pattern, eg <xxxxxx>.
You can run this program over any input file containing SGML
coding as follows:
        sed 's/<.*>//g' file.sgm > file.txt
This will match any string of the form <...> and remove it. The
output is then redirected into a new file.
This is a prime candidate for inclusion in a .BAT file, which could
also rename the output file to the name of the input file if this
were no longer required, eg:
        sed 's/<.*>//g' %1 > TEMPFILE
        ren TEMPFILE %1

If you require more specific pattern matching, a script of
commands can be supplied to sed, such as:
        s/<apple>//g
        s/<banana>//g
        s/<cherry>//g
which would substitute each of the specified strings for the
empty string.

If you have access to JANET you may obtain a copy of GNU sed from
lancs.pdosft using ftp or kermit.  The program resides in adirectory called
micros/ibmpc/f12 and is called f12sed.boo.
You will also need the `deboo' facility which is available
from the same site as a BASIC program. Contact lancs.pdsoft
for further details.

Alex Collier
University of Birmingham

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Editors for removing SGML coding

Erik Naggum-2
Ahem, I've been listening in to the discussions on this list for a
while, and I think this current debate on un-smgl-ifying (!) text
is quite beside the point.  SGML adds structure to and information
about a text.  This cannot just be "stripped" away, without major
loss of usefulness of the text.  I'm opposed to the very idea of
stripping off information just like that!

SGML texts are not always easily readable by humans, often thanks to
a very confused usage of SGML tags.  Lest we forget, we are dealing
with _Generalized_ Markup, not layout and sundry presentational details.
One of the things which attracted _me_ to SGML was the utter absence of
such litter in the information flow.  Some people, however, want to code
presentational devices, such as boldface, italics, bullets, etc, into
SGML, without thought for the general idea which caused these devices to
be used by some typographical genius way back in time when all they had
was Specific Markup, although nobody called it that.

The data content is information to the reader in the conventional way,
and SGML is a way to approach the information otherwise embedded in the
layout and presentation in a more abstract or general way.  Imagine
getting a newspaper with all the information spilled down in 8 columns
with no font changes for headings, not indent or line breaks for
paragraphs, ASCII encoded pictures and drawings, no typographical
distinction between ads and editorial matter, no company logos in the
ads, etc, etc.  That's what you're doing to a text by ripping out the
generalized markup that SGML encoding represents.  Just don't do it.

We should strive to get simple, yet powerful DTDs without cluttering the
text with all sorts of minutiae.  This will most probably take an almost
herculean effort, but such is life when working revolutions.

----------------

My second point is that all the examples of how to rip out the skeleton
of SGML text, to be left with an unstructured heap of data contents, are
way, way too powerful.

Consider the line of text:

        <author>Erik Naggum</author>

and the regular expression

        <.*>

The regular expression will match the entire line, _including_the_data_
_content_!  Probably not what you wanted.  Try this:

        <[^>]*>

but this doesn't match a start-tag accidentally left at the end of a
line, with attributes on the next line.  It doesn't match the null end
tag (net) enabled with the short tag feature, either:

        <p>This paragraph has an
        embedded <q/quotation/ in it.</p>

If you use marked sections, what do you do with

        <[ IGNORE [
            The next paragraph is not sufficiently clear.
            Do you have any suggestions for improvements, Tom?
        ]]>

The list goes on and on.  I think I have succeeded in showing why simple
sed scripts will be disastrous, and instead propose:

        We need to define a specific layout for each SGML tag,
        suitable for character based display units, and build
        this into a widely distributed piece of software.

This is not very different from a parser which produces, say, PostScript
code, but we still need to make it a complete parser.  Anything less
won't do the job.

I would also like to see more powerful tools, such as an SGML sensitive
"grep" utility, which can take a set of tags and a regular expression,
and find matches to the regexp only in the data content of the specified
tags.  I may be dreaming, though.

[Erik Naggum]
Naggum Software, Oslo, Norway

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Editors for removing SGML coding

Michael S. Hart-2
In reply to this post by COLLIEAJ
re the comments by Erik Naggum <[hidden email]>

"This cannot just be "stripped" away, without major
loss of usefulness of the text.  I'm opposed to the very idea of
stripping off information just like that!"

The text has no use whatsoever to anyone who does not have
the facilities to work with it, execpt of course for human
eye reading; HOWEVER . . . IF the text were made available
in BOTH SGML AND "STRIPPED FORMATS". . . THEN the text can
be used by anyone on any kind of computer.  The problem is
that search programs do not ignore this kind of markup, or
in some cases, even the c/r, l/f or even "soft c/r"s which
are used by many word processors.  It should be universal!
At least as universal as possible, so people can import an
article in SGML into WordPerfect, Word, or GREP, etc, etc,
and be able to use it, whether on a mainframe, an Apple I,
an Atari, a Commodore, etc, etc, etc.

What is it about this point that is so difficult to accept
or to understand?  Doesn't anyone want readers to be able,
in as many cases as possible, to take home an article with
them, load it onto their own PC, search for appropriate or
quotable portions, and then write their own response, with
the inclusion of the relevant passages?  Do you think I am
including the quotation or name and address of this author
of this note without using such features?

Thank you for your interest,

Michael S. Hart, Director, Project Gutenberg
National Clearinghouse for Machine Readable Texts

THESE NOTES ARE USUALLY WRITTEN AT A LIVE TERMINAL, AND THE
CHOICE OF WORDS IS OFTEN MEANT TO BE SUCH AS TO PROVOKE THE
GREATEST POSSIBLE RESPONSE SHORT OF BEING OFFENSIVE.  TRUTH
IN THESE NOTES IS OF GREAT CONCERN, THE FORM IS SECONDARY -
OTHER THAN THE TOKEN EFFORT OF JUSTIFIED RIGHT MARGINATION.

BITNET:  HART@UIUCVMD      INTERNET:  [hidden email]
(*ADDRESS CHANGE FROM *VME* TO *VMD* AS OF DECEMBER 18!!**)
(THE GUTNBERG SERVER IS LOCATED AT [hidden email])

NEITHER THE ABOVE NAMED INDIVIDUALS NOR ORGANIZATIONS ARE A
AN OFFICIAL REPRESENTATIVE OF ANY OTHER INSTITUTION NOR ARE
THE ABOVE COMMENTS MEANT TO IMPLY THE POLICIES OF ANY OTHER
PERSONS OR INSTITUTIONS, THOUGH OF COURSE WE WISH THEY DID.


The greatest use of electronic text is just that, that any
reference can be searched and quoted (accurately, I should
add) with minimal effort.  If all material in your library
were like that, think how easy it would be to do research.
Think of the millions of 3 x 5 cards, and their mistakes a
library of this kind could replace!  Think of the time any
researcher, from grade school to emeritus, could save!

Can this point be too something for all to understand?

Loading...