Quantcast

Identifying attributes that should be teidata.pointer

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Identifying attributes that should be teidata.pointer

Martin Holmes
HI all,

As part of a project to develop some generic code to run on any
collection of TEI documents and check idref integrity, we've been faced
with finding a relatively simple way to iterate through all the
attributes in a document that have the teidata.pointer type.

There are 109 attDefs with the teidata.pointer type in the P5 source,
some defined in classes and some directly on elements.

There are a further 29 attributes which share a name with one of the
109, but which are not pointers. For instance:

leaf/@value is teidata.pointer, but binary/@value is not.
keywords/@scheme is teidata.pointer, but tag/@scheme is not.

The situation is further complicated by the fact that some attributes
are derived from others by redefining with @mode="change", and may or
may not redefined their datatype (I believe there's only one case of
this affecting teidata.pointer attributes, alt/@target, but more may
crop up in time).

What would you say is the simplest, cleanest way to generate an XPath
which selects all and only the teidata.pointer attributes in a document
which validates against tei_all?

Cheers,
Martin
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Identifying attributes that should be teidata.pointer

Tomaž Erjavec
Martin Holmes je 02/03/2017 ob 18:16 napisal:
> What would you say is the simplest, cleanest way to generate an XPath
> which selects all and only the teidata.pointer attributes in a
> document which validates against tei_all?
Not cleanest, but it definitely simple: treat as a pointer any attribute
value that looks like a pointer, e.g. matches /^#/, /^http(s)?/ or
/\..{3,4}$/.
That's what we did, and it works pretty well - most errors in pointers
are to do with some typo or the target being 404. If links checking is
what you are after, of course.
Best,
Tomaž
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Identifying attributes that should be teidata.pointer

Martin Holmes
Hi Tomaž,

That's a really good idea. We'd have to allow for private URI schemes
too -- we want to dereference and check those -- and we're not going to
be checking external links, just project-internal ones, so we can
discard anything with a public protocol.

There are edge-cases, of course; you could have this:

<ref type="document">

versus this:

<ref target="document">

where the second is linking to a file with no extension. We might
stipulate that links to documents without extensions won't be checked. I
think we'll ignore XPointers, on the basis that this is a tool intended
for encoders who can't write their own diagnostic tools; any project
that uses a lot of XPointers probably has someone who can write XSLT.

So we could simply tokenize all attribute values on whitespace, check
each token to see if it looks like a pointer, and check it if it does.

Cheers,
Martin

On 2017-03-02 09:24 AM, Tomaž Erjavec wrote:

> Martin Holmes je 02/03/2017 ob 18:16 napisal:
>> What would you say is the simplest, cleanest way to generate an XPath
>> which selects all and only the teidata.pointer attributes in a
>> document which validates against tei_all?
> Not cleanest, but it definitely simple: treat as a pointer any attribute
> value that looks like a pointer, e.g. matches /^#/, /^http(s)?/ or
> /\..{3,4}$/.
> That's what we did, and it works pretty well - most errors in pointers
> are to do with some typo or the target being 404. If links checking is
> what you are after, of course.
> Best,
> Tomaž
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Identifying attributes that should be teidata.pointer

Tomaž Erjavec
Martin Holmes je 02/03/2017 ob 19:37 napisal:
> That's a really good idea. We'd have to allow for private URI schemes
> too -- we want to dereference and check those -- and we're not going
> to be checking external links, just project-internal ones, so we can
> discard anything with a public protocol.
>
Yes, of course, you can have private schemes, but hopefully you can
catch them with some regular expression.

> There are edge-cases, of course; you could have this:
>
> <ref type="document">
>
> versus this:
>
> <ref target="document">
>
> where the second is linking to a file with no extension. We might
> stipulate that links to documents without extensions won't be checked.
I'd think this is fair enough - it's a strange file that doesn't have an
extension.
> I think we'll ignore XPointers, on the basis that this is a tool
> intended for encoders who can't write their own diagnostic tools; any
> project that uses a lot of XPointers probably has someone who can
> write XSLT.
>
> So we could simply tokenize all attribute values on whitespace, check
> each token to see if it looks like a pointer, and check it if it does.
>
Exactly, and glad that you like the idea. In case it would help, you can
find our script - not very elegant and might not cover all cases - at
http://nl.ijs.si/tei/tools/check-links.xsl
Best,
Tomaž

> Cheers,
> Martin
>
> On 2017-03-02 09:24 AM, Tomaž Erjavec wrote:
>> Martin Holmes je 02/03/2017 ob 18:16 napisal:
>>> What would you say is the simplest, cleanest way to generate an XPath
>>> which selects all and only the teidata.pointer attributes in a
>>> document which validates against tei_all?
>> Not cleanest, but it definitely simple: treat as a pointer any attribute
>> value that looks like a pointer, e.g. matches /^#/, /^http(s)?/ or
>> /\..{3,4}$/.
>> That's what we did, and it works pretty well - most errors in pointers
>> are to do with some typo or the target being 404. If links checking is
>> what you are after, of course.
>> Best,
>> Tomaž
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Identifying attributes that should be teidata.pointer

Martin Holmes
Hi Tomaž,

On 2017-03-03 01:10 AM, Tomaž Erjavec wrote:
> Martin Holmes je 02/03/2017 ob 19:37 napisal:
>> That's a really good idea. We'd have to allow for private URI schemes
>> too -- we want to dereference and check those -- and we're not going
>> to be checking external links, just project-internal ones, so we can
>> discard anything with a public protocol.
>>
> Yes, of course, you can have private schemes, but hopefully you can
> catch them with some regular expression.

They're presumably documented with <prefixDef>s, and I've already
implemented the handling for this; it works nicely on the data we've
tested with so far. Checking all attributes is inevitably slow, but I
think you're right that it's the safest thing to do.

>> There are edge-cases, of course; you could have this:
>>
>> <ref type="document">
>>
>> versus this:
>>
>> <ref target="document">
>>
>> where the second is linking to a file with no extension. We might
>> stipulate that links to documents without extensions won't be checked.
> I'd think this is fair enough - it's a strange file that doesn't have an
> extension.

Yes, and even stranger that you would link to it from an XML document.

>> I think we'll ignore XPointers, on the basis that this is a tool
>> intended for encoders who can't write their own diagnostic tools; any
>> project that uses a lot of XPointers probably has someone who can
>> write XSLT.
>>
>> So we could simply tokenize all attribute values on whitespace, check
>> each token to see if it looks like a pointer, and check it if it does.
>>
> Exactly, and glad that you like the idea. In case it would help, you can
> find our script - not very elegant and might not cover all cases - at
> http://nl.ijs.si/tei/tools/check-links.xsl

Looks great. Ours is a bit longer, but it's trying to do a bit more
(check links to documents as well as ids in documents, and dereference
private URI schemes). It's here:

<https://github.com/projectEndings/diagnostics>

It's intended to work in Oxygen and outside it, and it's based on an ant
task.

Cheers,
Martin

> Best,
> Tomaž
>> Cheers,
>> Martin
>>
>> On 2017-03-02 09:24 AM, Tomaž Erjavec wrote:
>>> Martin Holmes je 02/03/2017 ob 18:16 napisal:
>>>> What would you say is the simplest, cleanest way to generate an XPath
>>>> which selects all and only the teidata.pointer attributes in a
>>>> document which validates against tei_all?
>>> Not cleanest, but it definitely simple: treat as a pointer any attribute
>>> value that looks like a pointer, e.g. matches /^#/, /^http(s)?/ or
>>> /\..{3,4}$/.
>>> That's what we did, and it works pretty well - most errors in pointers
>>> are to do with some typo or the target being 404. If links checking is
>>> what you are after, of course.
>>> Best,
>>> Tomaž
>
Loading...