Big (and Open) Data for Scholarship of All Sizes: A New Release of the HTRC Extracted Features Dataset
*Apologies for cross-listing*
HathiTrust today announces the release of a significantly expanded open
dataset, the HathiTrust Research Center (HTRC) Extracted Features (EF)
Dataset <https://analytics.hathitrust.org/datasets>, Version 1.0. This
dataset provides researchers with open access to data extracted from the
full text of the HathiTrust Digital Library <https://www.hathitrust.org/>
(HTDL) at an unprecedented scale.
The Extracted Features Dataset opens the complete HathiTrust collection
for investigations into historical and cultural trends, the rise and fall
of topics within the corpus, and the evolution of words and writing
structures in publications dating from the 16th to the late 20th century.
The EF Dataset provides quantitative information about word and line
counts, parts of speech, and other details within each page of every
volume in the HTDL. In addition to these larger-scale investigations, the
EF Dataset also allows researchers to closely analyze the contents of a
given volume or subset of volumes.
The data is extracted from 13.7 million volumes found in the HTDL,
representing over 5 billion pages consisting of over 2 trillion tokens
(words). A preliminary release of the EF Dataset, drawn from a much
smaller subset comprising only HathiTrust¹s public domain collection, has
already enabled novel research from scholars in economics, history,
linguistics, literary studies and sociology, among other fields.