text to bibliographic markup up-conversion

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

text to bibliographic markup up-conversion

James Cummings-4
Hi all,

This isn't necessarily related to TEI per se (though there is no
reason I couldn't use TEI as an intermediate format).  I'm
looking into various ways to get printed & OCR'd or word document
lists of bibliographic items into a more structured forms.  For
example a plain text list of items into BibTeX format to then
import into a system like Zotero or Endnote. The bibliographies
in question have _not_ been created with any reference-manager
software.

There are a number of tools suggested at
https://www.zotero.org/support/kb/importing_formatted_bibliographies 
but some of these are only web-based. (Though a couple like
FreeCite at Brown have an API.)  I definitely would prefer
something that was scriptable and even better if there is already
some nice commandline script for me to call per citation or long
list of newline separated citations.

I don't mind if it goes direct to some popular bibliographic
interchange format like BibTeX, or if it has another structured
format (XML, TEI, csv, etc.) as an intermediary.

I figured some of you have probably done this kind of thing
and/or have recommendations of existing tools that do this well.

-James

--
Dr James Cummings, Academic IT Services, University of Oxford,
TEI Consultations: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: text to bibliographic markup up-conversion

Paul Schaffner
I don't have an answer James, but would like one too.
Years ago, I ran a pilot attempt to extract lists of faculty
publications (taken from CVs) into a usable consistent format.
What I discovered was that (1) there are as many formats
as there are faculty members, more or less; (2) that much
information in citations is so compact and abbreviated or assumed
as to be very difficult even to interpret (by someone outside the
field), much less extracted automatically; and (3) that what
many humanists think of as citations, i.e. print publications in
the form of books, chapters, or articles, represent the tip of
an iceberg of different kinds of communications that get cited
in lists of publications in fields like electrical engineering,
genetics,
and medicine (patents, video games, commentary, code,
etc.)  I gave up on automatic extraction, concentrated on manual
extraction, and even then, after five iterations of increasingly
undemanding specs, despaired of being able to convert thousands
of such citation lists accurately and inexpensively.

Here is version 5, which may serve at least to suggest some
of the different formats and problems likely to be encountered:
http://www-personal.umich.edu/~pfs/cvs/spec_05.html

pfs

ps I should say that I did not, in most cases, have access to
a digital text of the source--many were image files wrapped
in pdfs. So simple data capture was also an issue.

On Wed, Jan 4, 2017, at 11:30, James Cummings wrote:

> Hi all,
>
> This isn't necessarily related to TEI per se (though there is no
> reason I couldn't use TEI as an intermediate format).  I'm
> looking into various ways to get printed & OCR'd or word document
> lists of bibliographic items into a more structured forms.  For
> example a plain text list of items into BibTeX format to then
> import into a system like Zotero or Endnote. The bibliographies
> in question have _not_ been created with any reference-manager
> software.
>
> There are a number of tools suggested at
> https://www.zotero.org/support/kb/importing_formatted_bibliographies 
> but some of these are only web-based. (Though a couple like
> FreeCite at Brown have an API.)  I definitely would prefer
> something that was scriptable and even better if there is already
> some nice commandline script for me to call per citation or long
> list of newline separated citations.
>
> I don't mind if it goes direct to some popular bibliographic
> interchange format like BibTeX, or if it has another structured
> format (XML, TEI, csv, etc.) as an intermediary.
>
> I figured some of you have probably done this kind of thing
> and/or have recommendations of existing tools that do this well.
>
> -James
>
> --
> Dr James Cummings, Academic IT Services, University of Oxford,
> TEI Consultations: [hidden email]
--
Paul Schaffner  Digital Library Production Service
[hidden email] | http://www.umich.edu/~pfs/
Reply | Threaded
Open this post in threaded view
|

Re: text to bibliographic markup up-conversion

Lavin, Matthew J
James (and listserv),

I haven’t implemented this kind of an approach, but my understanding is that this could be done in Python using Google Scholar to parse the bibliographic entry, and then use the Zotero API to convert the Google data to a Zotero citation object, which can then be exported in almost any standard citation format. If you want to open a Word or pdf document, you’ll need additional libraries such as pdfminer and python-docx. Of course, the document will also need to written in such a way that a common delimiter or regular expression can be used to separate one citation from the next. Something as simple as a newline marker should work.

Here is an implementation that ends with BibTeX instead of Xotero:
http://blog.macuyiko.com/post/2016/replacing-bibtex-references-with-dblp-entries-updated.html
 
I don’t know if this helpful. It sounds like Paul encountered lots of additional complications, although auto-detecting entries on a CV is also much more complicated than taking a list of only bibliographic entries and parsing it. The hardest part seems to be taking a plain text bibliography entry and parsing it into fields. The Google Scholar approach potentially has the benefit of returning a ranked list for each query, so you could do that part in one step, hand correct, and then send to Zotero, depending on the size of the bib.

Cheers and good luck,

Matthew Lavin
Clinical Assistant Professor of English and Director of Digital Media Lab
University of Pittsburgh

On 1/4/17, 12:08 PM, "Paul Schaffner" <[hidden email]> wrote:

    I don't have an answer James, but would like one too.
    Years ago, I ran a pilot attempt to extract lists of faculty
    publications (taken from CVs) into a usable consistent format.
    What I discovered was that (1) there are as many formats
    as there are faculty members, more or less; (2) that much
    information in citations is so compact and abbreviated or assumed
    as to be very difficult even to interpret (by someone outside the
    field), much less extracted automatically; and (3) that what
    many humanists think of as citations, i.e. print publications in
    the form of books, chapters, or articles, represent the tip of
    an iceberg of different kinds of communications that get cited
    in lists of publications in fields like electrical engineering,
    genetics,
    and medicine (patents, video games, commentary, code,
    etc.)  I gave up on automatic extraction, concentrated on manual
    extraction, and even then, after five iterations of increasingly
    undemanding specs, despaired of being able to convert thousands
    of such citation lists accurately and inexpensively.
   
    Here is version 5, which may serve at least to suggest some
    of the different formats and problems likely to be encountered:
    https://na01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww-personal.umich.edu%2F~pfs%2Fcvs%2Fspec_05.html&data=01%7C01%7Clavin%40PITT.EDU%7C5bedd4dedbcc45947cd108d434c457ab%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=NZvxBuXgZL0%2FCUdBMrPrUrdthl9T%2FgpahicrL8Pmh%2BE%3D&reserved=0
   
    pfs
   
    ps I should say that I did not, in most cases, have access to
    a digital text of the source--many were image files wrapped
    in pdfs. So simple data capture was also an issue.
   
    On Wed, Jan 4, 2017, at 11:30, James Cummings wrote:
    > Hi all,
    >
    > This isn't necessarily related to TEI per se (though there is no
    > reason I couldn't use TEI as an intermediate format).  I'm
    > looking into various ways to get printed & OCR'd or word document
    > lists of bibliographic items into a more structured forms.  For
    > example a plain text list of items into BibTeX format to then
    > import into a system like Zotero or Endnote. The bibliographies
    > in question have _not_ been created with any reference-manager
    > software.
    >
    > There are a number of tools suggested at
    > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.zotero.org%2Fsupport%2Fkb%2Fimporting_formatted_bibliographies&data=01%7C01%7Clavin%40PITT.EDU%7C5bedd4dedbcc45947cd108d434c457ab%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=ddR5d2yqY2aIDNHbevFrawNy7EKKyQfy3FIF34uuV4A%3D&reserved=0 
    > but some of these are only web-based. (Though a couple like
    > FreeCite at Brown have an API.)  I definitely would prefer
    > something that was scriptable and even better if there is already
    > some nice commandline script for me to call per citation or long
    > list of newline separated citations.
    >
    > I don't mind if it goes direct to some popular bibliographic
    > interchange format like BibTeX, or if it has another structured
    > format (XML, TEI, csv, etc.) as an intermediary.
    >
    > I figured some of you have probably done this kind of thing
    > and/or have recommendations of existing tools that do this well.
    >
    > -James
    >
    > --
    > Dr James Cummings, Academic IT Services, University of Oxford,
    > TEI Consultations: [hidden email]
    --
    Paul Schaffner  Digital Library Production Service
    [hidden email] | https://na01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.umich.edu%2F~pfs%2F&data=01%7C01%7Clavin%40PITT.EDU%7C5bedd4dedbcc45947cd108d434c457ab%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=ZG0dOAZ5mab8fXhUCRw3XuKcKLqxRKw8YZpCTu993PA%3D&reserved=0
   

Reply | Threaded
Open this post in threaded view
|

Re: text to bibliographic markup up-conversion

Serge Heiden-2
In reply to this post by James Cummings-4
Hi James,

What about GROBID: https://grobid.readthedocs.io/en/latest/Introduction ?

-Serge


Le 04/01/2017 à 17:30, James Cummings a écrit :

> Hi all,
>
> This isn't necessarily related to TEI per se (though there is no
> reason I couldn't use TEI as an intermediate format).  I'm looking
> into various ways to get printed & OCR'd or word document lists of
> bibliographic items into a more structured forms.  For example a plain
> text list of items into BibTeX format to then import into a system
> like Zotero or Endnote. The bibliographies in question have _not_ been
> created with any reference-manager software.
>
> There are a number of tools suggested at
> https://www.zotero.org/support/kb/importing_formatted_bibliographies 
> but some of these are only web-based. (Though a couple like FreeCite
> at Brown have an API.)  I definitely would prefer something that was
> scriptable and even better if there is already some nice commandline
> script for me to call per citation or long list of newline separated
> citations.
>
> I don't mind if it goes direct to some popular bibliographic
> interchange format like BibTeX, or if it has another structured format
> (XML, TEI, csv, etc.) as an intermediary.
>
> I figured some of you have probably done this kind of thing and/or
> have recommendations of existing tools that do this well.
>
> -James
>


--
Dr. Serge Heiden, [hidden email], http://textometrie.ens-lyon.fr
ENS de Lyon - IHRIM UMR5317
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
Reply | Threaded
Open this post in threaded view
|

Re: text to bibliographic markup up-conversion

Laurent Romary
J’allais le dire. And the output is definitely TEI (bibStruct, if you ask). There is an online demo which you can test under http://cloud.science-miner.com/grobid/ 
The baseline scenario is the extraction of information from scholarly paper. When you test it on other stuff it may or may not synchronise correctly.
Laurent

Le 4 janv. 2017 à 18:47, Serge Heiden <[hidden email]> a écrit :

Hi James,

What about GROBID: https://grobid.readthedocs.io/en/latest/Introduction ?

-Serge


Le 04/01/2017 à 17:30, James Cummings a écrit :
Hi all,

This isn't necessarily related to TEI per se (though there is no reason I couldn't use TEI as an intermediate format).  I'm looking into various ways to get printed & OCR'd or word document lists of bibliographic items into a more structured forms.  For example a plain text list of items into BibTeX format to then import into a system like Zotero or Endnote. The bibliographies in question have _not_ been created with any reference-manager software.

There are a number of tools suggested at https://www.zotero.org/support/kb/importing_formatted_bibliographies but some of these are only web-based. (Though a couple like FreeCite at Brown have an API.)  I definitely would prefer something that was scriptable and even better if there is already some nice commandline script for me to call per citation or long list of newline separated citations.

I don't mind if it goes direct to some popular bibliographic interchange format like BibTeX, or if it has another structured format (XML, TEI, csv, etc.) as an intermediary.

I figured some of you have probably done this kind of thing and/or have recommendations of existing tools that do this well.

-James



--
Dr. Serge Heiden, [hidden email], http://textometrie.ens-lyon.fr
ENS de Lyon - IHRIM UMR5317
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883

Laurent Romary
Inria, team Alpage