Quantcast

ODD by example utility

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

ODD by example utility

Sebastian Rahtz
I have entertained myself recently by writing a utility
which attempts to work out the minimal TEI customization
needed to validate a collection of files.

What I have done is create an XSLT (version 2) stylesheet which
traverses a nominated directory tree looking for
*.xml files which have <TEI> or <teiCorpus> root
elements. It analyzes the collection of elements
and attributes in the resulting corpus, and compares
that to the whole of TEI P5. An ODD file is generated
which

  * loads the required modules
  * deletes any elements which are not used
  * deletes any attributes (including class attributes)
    which are not used by each element
  * for every attribute which has a TEI "data.enumerated" datatype,
    constructs a closed <valList> enumerating the values actually used.

 From this you can construct a target schema, obviously.

Is this of interest to anyone apart from me? If so,
I could do with some testing and feedback.[1]

Memory capacity is an issue, obviously. My test set
is the XML files in the TEI P5 Guidelines "Test" directory,
and it can run over all the Shakespeare plays in a few seconds,
but it's not going to read a giant corpus without you have
a big load of memory to assign to Java. Caveat emptor.[2]

Want to try? grab getfiles.xsl and oddbyexample.xsl from Sourceforge
(http://tei.svn.sourceforge.net/viewvc/tei/trunk/Stylesheets2/tools2/)
and run it something like this:

saxon -o my.odd oddbyexample.xsl oddbyexample.xsl
corpus=/wherever/you/have/yourfiles/

The script assumes you have the TEI package which has a file
called "/usr/share/xml/tei/odd/p5subset.xml". If you don't
have that, grab http://www.tei-c.org/release/xml/tei/odd/p5subset.xml,
put the file somewhere, and add a "tei" parameter to point
at it.


[1] Warning: I don't think I can face
adding the code to handle any or all of

    * deriving simplified content models (beyond what Roma already does)
    * adding new elements and deriving a content model
    * dealing with non-TEI namespaces
    * generating attribute datatypes with complex regexps
    * working out Schematron constraints etc

but of course you are welcome to try yourself :-}

[2] no, not literally! it's open source, free etc

--
Sebastian Rahtz
Information Manager, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Sólo le pido a Dios
que el futuro no me sea indiferente

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ODD by example utility

Syd Bauman
> Is this of interest to anyone apart from me?

You're kidding, right?  YES!


>     * deriving simplified content models (beyond what Roma already
> does)

IIRC, there have been several papers written about proof-of-concept
projects that do this kind of stuff.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ODD by example utility

Sebastian Rahtz
In reply to this post by Sebastian Rahtz
Syd Bauman wrote:
>> Is this of interest to anyone apart from me?
>
> You're kidding, right?  YES!

I shall be interested to hear whether it flies for you.
As I hope I indicated, I have not really paid
much attention to memory usage. The thing is relatively
easy if you have all the docs in memory at once, but
doing it in a scaleable way to allow for multi-gigabyte corpora
would require a lot more care.

>>     * deriving simplified content models (beyond what Roma already
>> does)
>
> IIRC, there have been several papers written about proof-of-concept
> projects that do this kind of stuff.

my inclination is to improve what Roma does in this
area, rather than implement it in this utility, if
there is a need. But I guess thats a job for another
day :-}

--
Sebastian Rahtz
Information Manager, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Sólo le pido a Dios
que el futuro no me sea indiferente

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ODD by example utility

Martin Holmes
In reply to this post by Sebastian Rahtz
Hi Sebastian,

This is a wonderful idea. I'll give it a good workout next week -- I
have several projects that can really make use of it, and one in
particular has several thousand TEI files, so it'll be a serious stress
test. I can throw 6 or 7 GB of memory at Java if necessary.

Cheers,
Martin

Sebastian Rahtz wrote:

> I have entertained myself recently by writing a utility
> which attempts to work out the minimal TEI customization
> needed to validate a collection of files.
>
> What I have done is create an XSLT (version 2) stylesheet which
> traverses a nominated directory tree looking for
> *.xml files which have <TEI> or <teiCorpus> root
> elements. It analyzes the collection of elements
> and attributes in the resulting corpus, and compares
> that to the whole of TEI P5. An ODD file is generated
> which
>
>   * loads the required modules
>   * deletes any elements which are not used
>   * deletes any attributes (including class attributes)
>     which are not used by each element
>   * for every attribute which has a TEI "data.enumerated" datatype,
>     constructs a closed <valList> enumerating the values actually used.
>
>  From this you can construct a target schema, obviously.
>
> Is this of interest to anyone apart from me? If so,
> I could do with some testing and feedback.[1]
>
> Memory capacity is an issue, obviously. My test set
> is the XML files in the TEI P5 Guidelines "Test" directory,
> and it can run over all the Shakespeare plays in a few seconds,
> but it's not going to read a giant corpus without you have
> a big load of memory to assign to Java. Caveat emptor.[2]
>
> Want to try? grab getfiles.xsl and oddbyexample.xsl from Sourceforge
> (http://tei.svn.sourceforge.net/viewvc/tei/trunk/Stylesheets2/tools2/)
> and run it something like this:
>
> saxon -o my.odd oddbyexample.xsl oddbyexample.xsl
> corpus=/wherever/you/have/yourfiles/
>
> The script assumes you have the TEI package which has a file
> called "/usr/share/xml/tei/odd/p5subset.xml". If you don't
> have that, grab http://www.tei-c.org/release/xml/tei/odd/p5subset.xml,
> put the file somewhere, and add a "tei" parameter to point
> at it.
>
>
> [1] Warning: I don't think I can face
> adding the code to handle any or all of
>
>     * deriving simplified content models (beyond what Roma already does)
>     * adding new elements and deriving a content model
>     * dealing with non-TEI namespaces
>     * generating attribute datatypes with complex regexps
>     * working out Schematron constraints etc
>
> but of course you are welcome to try yourself :-}
>
> [2] no, not literally! it's open source, free etc
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ODD by example utility

Paul Caton-4
In reply to this post by Sebastian Rahtz
Sebastian,

I think this would be an incredibly useful tool. I personally find it much easier to arrive at a desirable set of constraints on the encoding of a particular set of files by actually doing some encoding first rather than sitting with Roma open and thinking "now what set of values should I allow for this attribute?" If I could work on a small sample - say ten or so files (if they're short - and then when I was happy with them generate my ODD that I can then use on the next couple of hundred files, why then I would be a happy man.

Paul.

--
Dr. Paul Caton
Research Fellow, TEXTE Project
Moore Institute
National University of Ireland, Galway
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ODD by example utility

Sebastian Rahtz
In reply to this post by Sebastian Rahtz
Paul Caton wrote:

> If I could work on a small sample -
> say ten or so files (if they're short - and then when I was happy with
> them generate my ODD that I can then use on the next couple of hundred
> files, why then I would be a happy man.

then I think you are but a download away from achieving
that much-desired state :-}


--
Sebastian Rahtz
Information Manager, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Sólo le pido a Dios
que el futuro no me sea indiferente

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ODD by example utility

Stefan Majewski
In reply to this post by Sebastian Rahtz
Sebastian Rahtz wrote:
> Is this of interest to anyone apart from me? If so,
> I could do with some testing and feedback.[1]

Hey, this is just great. Just tried it with our corpus and it works like
a charm. The resulting schema is much more neatly fitting to the our
corpus-format than the previous hand-crafted odd. The only thing one
really has to take care of is to check whether it restricts things that
should be possible but do not yet
occur.

> [1] Warning: I don't think I can face
> adding the code to handle any or all of
>
>    * deriving simplified content models (beyond what Roma already
> does)       * dealing with non-TEI namespaces

Just a thought: we use some attributes and elements from a custom
namespace. Previously I defined them in my own customisation, now I just
dropped them into the output of oddbyexample. Would it be possible to
use my previous odd as starting point for oddbyexample?



Well, thank you for this great and useful tool!

best,

Stefan

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: ODD by example utility

Sebastian Rahtz
In reply to this post by Sebastian Rahtz
Stefan Majewski wrote:
>
> Just a thought: we use some attributes and elements from a custom
> namespace. Previously I defined them in my own customisation, now I
> just dropped them into the output of oddbyexample. Would it be
> possible to use my previous odd as starting point for oddbyexample?
hmm. I am not sure how this might work. can you send me your ODD so that
I can have a try?

--
Sebastian Rahtz
Information Manager, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Loading...