Associate all the <p> paragraphs with their page numbers

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Associate all the <p> paragraphs with their page numbers

Kamel HOUCHAT

Hello,

I am using OxGarage and Stylesheets to convert .docx files to tei p5, and then do parsing on all <p> paragraphs of tei file with python.

To perform this analysis, I need to know which page exactly each paragraph is in the document.

I would like to know if there is a way to associate each paragraph with its page? (maybe with the <pb> tag ???)

Thank you for your answer

 

Reply | Threaded
Open this post in threaded view
|

Re: Associate all the <p> paragraphs with their page numbers

Elisa Beshero-Bondar-2
Hi Kamel,
Do you have <pb/> elements in your output from Oxgarage? This is a job for XPath, but I imagine you will have <pb/> elements appearing inside some <p> elements and in between others. So some <p> elements will be split over two pages, as is the way of things. And that means you will likely face some complications with your XPath.

It might be a better idea to number your paragraphs and work with them in your Python script on the basis of their sequential order. This is something you could do with XSLT, or probably also in your Python script, if you are accessing an XML library like etree. Referring to your paragraphs by their number in relation to one another as in count(preceding::p) + 1  will probably be easier than dealing with determining what to do when a <pb> is inside some <p> elements and in between others. But you can still come up with an algorithm to help you determine the appropriate page, if for example, you decide that you always want to associate a <p> with the page on which it *begins*. You need to find the first  <pb/> along the XPath preceding:: axis.  

A lot depends on what you want to do with your Python parsing. Often I want to find out how much of a certain phenomenon turns up inside a paragraph, and that would lead me to want to refer to paragraphs by their sequential number, so that’s why I suggested numbering your paragraphs. 

One last question: is Python the right tool for the task you are wanting to accomplish? Python libraries like XML etree are using an old 1.0 version of XPath, and XSLT or XQuery will probably be a less verbose and more up-to-date way to “dig into” your XML nodes based on their positions in the tree hierarchy. But let us know more about what you’re trying to do! 

Cheers,
Elisa

Elisa Beshero-Bondar, PhD
Program Chair of Digital Media, Arts, and Technology | Professor of Digital Humanities |  Director of the Digital Humanities Lab at Penn State Erie, The Behrend College 

Typeset by hand on my iPhone

On Nov 26, 2020, at 4:30 AM, Kamel HOUCHAT <[hidden email]> wrote:



Hello,

I am using OxGarage and Stylesheets to convert .docx files to tei p5, and then do parsing on all <p> paragraphs of tei file with python.

To perform this analysis, I need to know which page exactly each paragraph is in the document.

I would like to know if there is a way to associate each paragraph with its page? (maybe with the <pb> tag ???)

Thank you for your answer

 

Reply | Threaded
Open this post in threaded view
|

Re: Associate all the <p> paragraphs with their page numbers

Peter Stadler
In reply to this post by Kamel HOUCHAT
Hi Kamel,

the default transformation from docx to tei will not preserve the arbitrary page breaks but only those that have been manually added. I see that there’s a parameter „preserveSoftPageBreaks" in the transformation script docxtotei.xsl [1] which might be what you need to get all(?) the page breaks in the resulting TEI file.

Yet (to my knowledge) it’s not possible to inject this parameter through the standard OxGarage GUI and so far I haven’t found a way to properly construct a POST request to the OxGarage REST interface. So maybe you’ll have to refrain from using OxGarage but use the TEI Stylesheets directly in your pipeline?

Hope that helps 
Peter 

[1] https://github.com/TEIC/Stylesheets/blob/f410b156d26dc3b6e845f14faf1681e176e7a6b1/docx/from/docxtotei.xsl#L49

Am 26.11.2020 um 10:16 schrieb Kamel HOUCHAT <[hidden email]>:

Hello,

I am using OxGarage and Stylesheets to convert .docx files to tei p5, and then do parsing on all <p> paragraphs of tei file with python.

To perform this analysis, I need to know which page exactly each paragraph is in the document.

I would like to know if there is a way to associate each paragraph with its page? (maybe with the <pb> tag ???)

Thank you for your answer


Reply | Threaded
Open this post in threaded view
|

Re: Associate all the <p> paragraphs with their page numbers

Imsieke, Gerrit, le-tex
The page breaks that result from automatic pagination are not encoded in
the docx file. Therefore it is impossible to convert them to pb elements.

The only remedy that I can think of is to write a VBA script that
inserts bookmarks at each page break or at least at the beginning or end
of each paragraph that is rendered on a new page. This information
should be available in a running instance of MS Word:
https://docs.microsoft.com/en-us/office/vba/api/word.wdinformation
I’m not a VBA expert, but I guess that you need to iterate over the
paragraphs and then you can add the page number for example encoded as a
bookmark ID or in a hidden character range. But then you need to
instruct the tools how to transform this information into pb elements,
or just to convert the VBA-generated bookmarks into anchor elements like
<anchor xml:id="page30para1"/>. The tools might do this already. But
then you still need to do the heavy lifting and write some VBA…

Gerrit

On 26.11.2020 16:02, Peter Stadler wrote:

> Hi Kamel,
>
> the default transformation from docx to tei will not preserve the
> arbitrary page breaks but only those that have been manually added. I
> see that there’s a parameter „preserveSoftPageBreaks" in the
> transformation script docxtotei.xsl [1] which might be what you need to
> get all(?) the page breaks in the resulting TEI file.
>
> Yet (to my knowledge) it’s not possible to inject this parameter through
> the standard OxGarage GUI and so far I haven’t found a way to properly
> construct a POST request to the OxGarage REST interface. So maybe you’ll
> have to refrain from using OxGarage but use the TEI Stylesheets directly
> in your pipeline?
>
> Hope that helps
> Peter
>
> [1]
> https://github.com/TEIC/Stylesheets/blob/f410b156d26dc3b6e845f14faf1681e176e7a6b1/docx/from/docxtotei.xsl#L49 
> <https://github.com/TEIC/Stylesheets/blob/f410b156d26dc3b6e845f14faf1681e176e7a6b1/docx/from/docxtotei.xsl#L49>
>
>> Am 26.11.2020 um 10:16 schrieb Kamel HOUCHAT
>> <[hidden email] <mailto:[hidden email]>>:
>>
>> Hello,
>>
>> I am using OxGarage and Stylesheets to convert .docx files to tei p5,
>> and then do parsing on all <p> paragraphs of tei file with python.
>>
>> To perform this analysis, I need to know which page exactly each
>> paragraph is in the document.
>>
>> I would like to know if there is a way to associate each paragraph
>> with its page? (maybe with the <pb> tag ???)
>>
>> Thank you for your answer
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Associate all the <p> paragraphs with their page numbers

Bauman, Syd
In reply to this post by Elisa Beshero-Bondar-2
One minor disagreement with my learned colleague from Pennsylvania …

Using the “normal” or “correct” (my words) method of encoding <pb>s, i.e., truthfully placing them where they occur in the original — a page that starts with a new paragraph has its <pb> encoded between the paragraph that ended on the previous page and the paragraph that starts on this page — the XPath for asking “on what page does this paragraph start” is trivial, even if <pb> elements occur inside <p> elements sometimes. It is preceding::pb[1]/@n (presuming you have encoded the page number on the @n attribute).

Even if you are interested in the page number on which the paragraph ends, the XPath is not particularly complicated:
( preceding::pb | descendant::pb )[last()]/@n

All that said, I think Peter is right, unless you have manually inserted MS Word page breaks into your document, it will be difficult (or, as Gerrit suggests, very, very difficult) to get <pb> elements into the converted-to-TEI output.

P.S. Although the XPaths above are not particularly complex, they could be particularly expensive, computationally speaking. Each asks the processor to collect a whole lot of nodes, and then pick only 1 of them. If your XPath processor can optimize that, and only go get the 1 node of interest without bothering with the others, it should be very fast. But if it cannot, and really goes to get all those nodes, it could be quite slow.[1] You can (taking a hint from David Birnbaum’s thrilling 2009 adventure into XPath Quicksand) make that 2nd XPath more complex but potentially easier to optimize by performing node thinning earlier.[2]

Notes
[1] Saxon, as I expected, seems to be able to optimize the first but not the second XPath above. (Using Saxon 10 to run a tiny XSLT pgm with that does almost nothing except that 1st XPath took an average of ~18 ms on our largest file, which has over 3,000 <p> elements; running it using the 2nd XPath took an average of ~780 ms.) Same seems to be true of `xmlstarlet`. (Running the 1st XPath over our entire set of 427 files it took 2.1 CPU seconds; running the 2nd XPath took 27.1 CPU seconds.) Since XPath optimizations differ significantly from one processor to the next, you may find it is brilliantly fast on yours. Or you may find them both sluggish.

[2] The XPath
( preceding::pb[1] | descendant::pb[last()] )[last()]/@n
ran almost exactly as fast as the 1st XPath for me in both Saxon 10 and `xmlstarlet`, and produces precisely the same results.



Do you have <pb/> elements in your output from Oxgarage? This is a job for XPath, but I imagine you will have <pb/> elements appearing inside some <p> elements and in between others. So some <p> elements will be split over two pages, as is the way of things. And that means you will likely face some complications with your XPath.

It might be a better idea to number your paragraphs and work with them in your Python script on the basis of their sequential order. This is something you could do with XSLT, or probably also in your Python script, if you are accessing an XML library like etree. Referring to your paragraphs by their number in relation to one another as in count(preceding::p) + 1  will probably be easier than dealing with determining what to do when a <pb> is inside some <p> elements and in between others. But you can still come up with an algorithm to help you determine the appropriate page, if for example, you decide that you always want to associate a <p> with the page on which it *begins*. You need to find the first  <pb/> along the XPath preceding:: axis.  

A lot depends on what you want to do with your Python parsing. Often I want to find out how much of a certain phenomenon turns up inside a paragraph, and that would lead me to want to refer to paragraphs by their sequential number, so that’s why I suggested numbering your paragraphs. 

One last question: is Python the right tool for the task you are wanting to accomplish? Python libraries like XML etree are using an old 1.0 version of XPath, and XSLT or XQuery will probably be a less verbose and more up-to-date way to “dig into” your XML nodes based on their positions in the tree hierarchy. But let us know more about what you’re trying to do!