Non-Consumptive TDM with The HathiTrust Research Center

HATHITRUST A Shared Digital Repository Non-Consumptive TDM with The HathiTrust Research Center Peter Organisciak, Post-Doctoral Research Associate HTRC Exec Management: J. Stephen Downie (PI), Beth Plale (PI), Beth Namachchivaya Robert H. McDonald, Mike Furlough

Mission and Purpose To contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human knowledge. A trusted digital preservation service enabling the broadest possible access worldwide. An organization with over 100 research libraries partnering to develop its programs. A range of transformative programs enabled by working at a very large scale.

Allegheny College American University of Beirut Arizona State University Auburn University Baylor University Boston College Boston University Brandeis University Brown University Bryn Mawr College Carnegie Mellon University Case Western Reserve Colby College Columbia University Cornell University Dartmouth College Duke University Emory University Getty Research Institute Georgetown University Georgia Tech Harvard University Library Haverford College Indiana University Iowa State University Johns Hopkins University Kansas State University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University Montana State University Mount Holyoke College New Mexico State University New York Public Library New York University North Carolina Central University 3 HathiTrust Members North Carolina State University Northeastern University Northwestern University Oklahoma State University The Ohio State University The Pennsylvania State University Princeton University Purdue University Rutgers University Smith College Stanford University State University System of Florida Swarthmore College Syracuse University Temple University Texas A&M University Texas Christian University Texas Tech University Tufts University Universidad Complutense de Madrid University of Alabama University of Alberta University of Arizona University of British Columbia University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz California Digital Library The University of Chicago University of Connecticut University of Delaware University of Houston University of Illinois University of Illinois at Chicago The University of Iowa University of Kansas University of Maine University of Maryland University of Massachusetts, Amherst University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln University of Nevada-Las Vegas University of New Mexico The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahoma University of Pennsylvania University of Pittsburgh University of Queensland University of Rochester University of Tennessee, Knoxville University of Texas University of Utah University of Vermont University of Virginia University of Washington University of Wisconsin-Madison University of Wyoming Utah State University Vanderbilt University Virginia Tech Wake Forest University Washington University Washington State University Yale University Library

Cooperative Work We draw upon distributed expertise Administration Michigan Indiana Illinois California Preservation & Access Repository Research Center Metadata Management (Zephir)

Scale of the HathiTrust Collection 13,893,608 total volumes 6,920,679 book titles 367,828 serial titles 4,862,762,800 pages ~625 terabytes 5,434,351 open volumes (~39% of total) The collection includes (mostly) published materials in bound form, digitized from library collections.

Contributions by Library, Nov 2015 Institution Volumes University of Michigan 4,696,618 University of Institution California Volumes 3,707,214 Harvard University University of Michigan 4,722,050 838,344 Cornell University University of California 3,639,937 584,875 University Harvard of University Wisconsin - Madison 838,122 561,700 Indiana University University of Wisconsin 561,534 530,588 Indiana University 529,798 University of Minnesota 438,134 Cornell University 515,753 University of Illinois at Urbana-Champaign 437,288 Penn State 389,247 Pennsylvania State University 390,087 University of Illinois 348,946 New York Public Library 310,737 University of Minnesota 334,249 Princeton University 252,885 New York Public Library 304,610 The Ohio State University 118,513 Princeton University 252,841 Universidad Complutense de Madrid 117,508 Universidad Complutense 117,322 Library of Congress 108,892 Library of Congress 108,892 University of Chicago 99,181 Keio University 90,122 Keio University 90,126 University of Alberta 76,106 University of Alberta 76,114 Ohio State 74,525 Columbia University 74,514 Columbia University 73,396 Northwestern University 57,142 Northwestern University 57,000 University University of Virginia of Chicago 56,98151,220 Purdue University University of Virginia 51,20747,490 University of Iowa 40,622 Technical Report Archive & Image Library 35,923 6

The top 10 languages make up ~87% of all content

HathiTrust Titles by Copyright/View Status 8

Call Number Distribution

HathiTrust Titles by Date and Viewing Status Dates

Non-Consumptive Research From the rejected Google Book Settlement: Non-Consumptive Research means research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book. (a) Image Analysis and Text Extraction (b) Textual Analysis and Information Extraction (c) Linguistic Analysis (d) Automated Translation (e) Indexing and Search

Introducing the HathiTrust Research Center

Mission of the HT Research Center Research arm of HathiTrust Established: July, 2011 Collaborative center: Indiana University & University of Illinois Mission: Enable researchers world-wide to accomplish text data-mining and analysis on texts in public domain and under copyright Enable large-scale analysis on texts (corpus > 1M volumes) Create and support tools and semantic structuring for analyzing texts Develop translational tools and data to enhance HathiTrust Digital Library services to users

HTRC Governance Reports to the HathiTrust Board of Governors HTRC Executive Committee J. Stephen Downie (Co-director), Professor and Associate Dean for Research, Univ of Illinois GSLIS Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana Univ Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning & Policy at the Univ of Illinois Beth Plale (Co-director and Chair), Director Data To Insight Center and Professor in School of Informatics and Computing at Indiana Univ John Unsworth, Vice Provost for Library & Technology Services and Chief Information Officer at Brandeis Univ

Non-consumptive research through HTRC secure commons

Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.

Challenge of Secure Commons ~60% of digitized texts are under copyright Text content is large : linear walk 1M volumes:1000 cores:1day. Data can t leave. Researchers want: To use their own tools, even for small analysis Intimate interaction with texts So: data can t leave yet software coming to data can be suspicious. Researchers can t consume copyright content yet still need intimate interaction.

Secure Commons Trust Ring Logical ring within which exist trusted services and computers that protect and provide access to the sensitive (copyright) data Computation moves to the data not vice versa Computation carried out in the trust ring IU UIUC

Researcher Interaction Interaction with HTRC is through one of three options: 1. Services and tools for data extraction, data cleaning, data analysis and results visualization. Self service, browser-based. 2. Check out a Data Capsule VM. Researcher checks out and configures for their use (currently for the technology savvy) 3. Direct engagement with HTRC staff HTRC Portal: https://sharc.hathitrust.org/

HTRC Portal

Searching Robinson Crusoe in the Workset Builder Basic search interface for building a workset

Results of Robinson Crusoe search using the Solr API Search and workset-creation option for the more technical user

Custom Robinson Crusoe workset request generated from MODS database Most robust search done on behalf of users who will request a custom dataset from HT

Self service portal for services and tools

We are seeing numerous cases where analysis is a pipeline: simplified into 4 stages below. Plugging in at each stage is a tool (e.g., open source, user designed, community based) Data Extraction Data Cleaning Data Analysis Visualization HathiTrust texts Input Parameters (JSON) Task output (JSON) Tasks can be programs written in any language Python, R, Java, C#, Overall Result Graphs Raw data Structured data Etc. Result: stored to workset

HTRC Data Capsule

HTRC Data Capsule concept Researcher checks out a virtual machine (VM) VM runs in the Trust Ring Researcher owns their VM through weeks/months of analysis Getting stuff into VM is easy, but there is a controlled and audited process for getting results out of the VM

@hathitrust HTRC Data Capsule HTRC Data Capsule@IU Team Beth Plale (PI) Jiaan Zeng Guangchen Ruan Special Thanks to Samitha Liyanage Milinda Pathirage Zong Peng Earlence Fernandes Ajit Aluri HTRC Data Capsule@Michigan Team Atul Prakash (PI) Alexander Crowell Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale. 2014. Cloud computing data capsules for nonconsumptiveuse of texts. In Proceedings of the 5th ACM workshop on Scientific cloud computing (ScienceCloud '14). ACM, New York, NY, USA, 9-16. DOI=10.1145/2608029.2608031 http://doi.acm.org/10.1145/2608029.2608031

Data Capsule with i-python installed

Mode switch protection: maintenance mode Arbitrary network download allowed HTRC raw data sources User traffic from desktop allowed Data Capsule Arbitrary network upload allowed during maintenance mode, researcher installs new software and loads data into capsule

Mode switch protection: secure mode Arbitrary network download not allowed HTRC raw data sources Arbitrary network upload not allowed User traffic from desktop allowed Researcher switches to secure mode when ready to run her tools Data Capsule Results : researcher tools must write results to special directory; these are reviewed before release

Priorities Expand data capsule to support more users and run on larger number of cores Dockerize the software services Develop out the workset (user s context while working in secure commons) Switch from OAuth authentication to Shibboleth/inCommon Support broader range of canned analysis algorithms 32

Extracted Features Dataset (HTRC EF)

Features are a translation of text from language that humans understand to language that machines understand. Raw text Translation into features (you are here) Algorithmic use

Hard to make one size fit all Extracted features dataset assists in More obscure questions Functionality not in htrc Sensitivity to what is happening to data

Data https://sharc.hathitrust.org/features 1825 million pages, in 4.8 million volumes Currently restricted to public domain scanned works

Per section of each page (header/footer/body) Token count, line count, empty line count, sentence count Counts of characters occurring at the beginnings of lines, end of lines Pos-tagged token frequencies (case-sensitive) E.G. Rose (verb), rose (noun), and rose are counted separately

Possibilities Compare term counts, word clouds Within-book comparison of themes Classification against metadata (e.g. Build a genre detector!) Identify part of book (via character information) Identify chapter headings, frontispieces (via line count information) Topic modeling

CO- OCCURRENC E TABLES DAVID MIMNO

Next steps Public domain track Bigrams, trigrams Entity extraction Non-pd Copyrighted and unknown status 8 million more!

Extracted Features

ACS: Research Projects

ACS Research Project #1

ACS Research Projects #2

Tracking Technology Diffusion Through Time in the HathiTrust Corpus Michelle Alexopoulos, University of Toronto Dr. Alexopoulos, an economist, is using the vast historical record contained in the HathiTrust to study the diffusion of various technologies over time. By tracking word usage trends of 1,214 technology-related terms identified by Alexopoulos, such as the steam engine, her research based on HathiTrust book content has the potential to overturn accepted theories about the economic and societal impacts of a technology. ACS Research Projects #3 VS. Linkages to Steam Engines implied by the Library of Congress Classification From HT text: Selected subject terms linked to Steam engine n-gram by 1910 1,012,633 volumes analyzed. Over 22 hours of processing using a 32-node cluster on Indiana University s high-performance supercomputer, Big Red II. Each node had 32 cores and 64 GB of RAM. HTRC Use Case: Collaboration between Scholars and the HTRC

Use Case HT+Bookworm

HathiTrust + Bookworm A tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library http://bookworm.htrc.illinois.edu

BookWorm Components BOOKWORMAPI QUANTITATIVE QUERIES OVER COLLECTION BOOKWORMGUI THE TIME SERIES VISUALIZATION

Regularization of Verbs: A Bookworm Example burned (blue graph ) and burnt (orange graph)

Upcoming Work: New Mellon Grant WCSA+DC Funded by Andrew W. Mellon Foundation $1,117, 000 Two years intensive work rebuilding Workset Builder and Data Capsules Need to scale up Need finer access and control Improve security Partners at Brandeis, Oxford, Waikato, Illinois

Thank You!