Something has been bothering me about the discussion
and I finally figured out what it was. The assumption has been that the SGML information in a text is supplemental to the ordinary conventions of writing such as line breaks, paragraph boundary marking and other formatting conventions achieved through the use of white space. That isn't true. Pure SGML marked-up text doesn't even require that there are ANY line breaks or more than one blank between anything other than words. Thus, `stripping' the SGML out need not produce anything you'd want as it could run together the whole text. The Oxford English Dictionary, which is 1/2 gigabyte in size and completely done in SGML markup contains NO carriage-returns. This isn't a fault of SGML since `stripping out the markup' would produce similarly damaged text for any text formatted system I know of under many circumstances. E.g. `stripping' out the troff commands of a UNIX document (i.e. deleting them all where they appear) would trash the text. What you would and should do to eliminate such markup is format the document for an output device having no highlighting or interline spacing and dependent upon your preference either as an 80-column width non-page segmented file or into 80-column width pages of some fixed-length. This can be done for a few commercial text formatting systems, but not very many as it requires making a lot of decisions about what to do with text conventions that have no real counterparts in the 80-column no-half-line, no underlining, no page break world. <P>Einstein<foot>Albert Einstein</foot> said <quote>E=mc<sup>2</sup></quote> and the world changed.</P> EinsteinAlbert Einstein said E=mc2 and the world changed. vs. Einstein* said, `E=mc**2' and the world changed. * Albert Einstein |
>Something has been bothering me about the discussion
>and I finally figured out what it was. The assumption >has been that the SGML information in a text is supplemental >to the ordinary conventions of writing such as line breaks, >paragraph boundary marking and other formatting conventions >achieved through the use of white space. That isn't true. This has been bugging me too, WYSIWYG users aren't aware of what they `don't see', there is formatting information hiddden from the user in a WYSIWYG. >... What you would and should do >to eliminate such markup is format the document for an output >device having no highlighting or interline spacing and dependent >upon your preference either as an 80-column width non-page segmented >file or into 80-column width pages of some fixed-length. I agree and think this is a worst case situation. Yes it is necessary to be able to do what you describe, but what we need are lots of filters. They can be lex (or GNU's flex), awk or nawk, icon... Anyone of these string manipulators will work. I think people should be less concerned about `stripping' and more concerned about translating and converting text to and from SGML. Any proper filter can be made with a user accessible file of ascii descriptions. The idea should be - to be able to convert from SGML in to something workable, such as TeX. One could easily convert to WYSIWYG (not sure one could convert it back again without loss of info without some sort of labeling system). The point is this can be done now by just about anyone, without waiting for ultimate software to appear. Most reasonable text processing tools have some scheme already in place to produce a file for a dumb output device. People will to want to edit that `stripped text' and convert it back to SGML. Why strip it? When the new manual pages are released for BSD UNIX, there will be hooks in the lex filters for this purpose. The manual pages will be convertible from TeX to troff and vice versa (new content based macro packages). While BSD will not include SGML as part of 4.4, the hooks will be there, and it won't be hard for someone to do. Hopefuly too, by planning flexibility now, if a better standard (i.e. one endorsed by the public) becomes available, it won't be hard to change. Cynthia |
In reply to this post by Robert A Amsler-2
Perhaps the stripping question should be reformulated as follows, "(How) can
one throw together a simple filter that will format a document marked up in SGML for viewing at an ASCII terminal?" I believe that the answer has already been sketched several times in several different ways. Naturally there will be things that can't be represented on an ASCII terminal. |
Free forum by Nabble | Edit this page |