Why `stripping out the SGML' isn't possible

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Why `stripping out the SGML' isn't possible

Robert A Amsler-2
Something has been bothering me about the discussion
and I finally figured out what it was. The assumption
has been that the SGML information in a text is supplemental
to the ordinary conventions of writing such as line breaks,
paragraph boundary marking and other formatting conventions
achieved through the use of white space. That isn't true.

Pure SGML marked-up text doesn't even require that there are ANY
line breaks or more than one blank between anything other than
words. Thus, `stripping' the SGML out need not produce anything
you'd want as it could run together the whole text.

The Oxford English Dictionary, which is 1/2 gigabyte in size
and completely done in SGML markup contains NO carriage-returns.

This isn't a fault of SGML since `stripping out the markup'
would produce similarly damaged text for any text formatted
system I know of under many circumstances. E.g. `stripping' out the troff
commands of a UNIX document (i.e. deleting them all where they
appear) would trash the text. What you would and should do
to eliminate such markup is format the document for an output
device having no highlighting or interline spacing and dependent
upon your preference either as an 80-column width non-page segmented
file or into 80-column width pages of some fixed-length.

This can be done for a few commercial text formatting systems,
but not very many as it requires making a lot of decisions
about what to do with text conventions that have no real
counterparts in the 80-column no-half-line, no underlining,
no page break world.

<P>Einstein<foot>Albert Einstein</foot> said <quote>E=mc<sup>2</sup></quote>
and the world changed.</P>

EinsteinAlbert Einstein said E=mc2
and the world changed.

vs.

   Einstein* said, `E=mc**2' and the world changed.

* Albert Einstein

Reply | Threaded
Open this post in threaded view
|

Re: Why `stripping out the SGML' isn't possible

Cynthia Livingston
  >Something has been bothering me about the discussion
  >and I finally figured out what it was. The assumption
  >has been that the SGML information in a text is supplemental
  >to the ordinary conventions of writing such as line breaks,
  >paragraph boundary marking and other formatting conventions
  >achieved through the use of white space. That isn't true.

This has been bugging me too, WYSIWYG users aren't aware of what they
`don't see', there is formatting information hiddden from the user
in a WYSIWYG.

  >... What you would and should do
  >to eliminate such markup is format the document for an output
  >device having no highlighting or interline spacing and dependent
  >upon your preference either as an 80-column width non-page segmented
  >file or into 80-column width pages of some fixed-length.


I agree and think this is a worst case situation. Yes it is
necessary to be able to do what you describe, but what we need are
lots of filters.  They can be lex (or GNU's flex), awk or nawk,
icon... Anyone of these string manipulators will work.  I think
people should be less concerned about `stripping' and more concerned
about translating and converting text to and from SGML.  Any proper
filter can be made with a user accessible file of ascii descriptions.
The idea should be - to be able to convert from SGML in to something
workable, such as TeX.  One could easily convert to
WYSIWYG (not sure one could convert it back again without loss of
info without some sort of labeling system). The point is this can
be done now by just about anyone, without waiting for ultimate
software to appear.  Most reasonable text processing tools have some
scheme already in place to produce a file for a dumb output device.
People will to want to edit that `stripped text' and convert
it back to SGML. Why strip it?

When the new manual pages are released for BSD UNIX, there will be
hooks in the lex filters for this purpose. The manual pages will
be convertible from TeX to troff and vice versa (new content based
macro packages).  While BSD will not include SGML as part of 4.4,
the hooks will be there, and it won't be hard for someone to do.

Hopefuly too, by planning flexibility now, if a better standard (i.e.
one endorsed by the public) becomes available, it won't be hard to
change.

Cynthia

Reply | Threaded
Open this post in threaded view
|

Re: Why `stripping out the SGML' isn't possible

koontz
In reply to this post by Robert A Amsler-2
Perhaps the stripping question should be reformulated as follows, "(How) can
one throw together a simple filter that will format a document marked up in
SGML for viewing at an ASCII terminal?"  I believe that the answer has
already been sketched several times in several different ways.  Naturally
there will be things that can't be represented on an ASCII terminal.