Non-Consumptive TDM with The HathiTrust Research Center

Similar documents
Digitization and Aggregation Enabling a Print Network

DOCTORAL/RESEARCH INSTITUTIONS RECEIVING FULBRIGHT AWARDS FOR

Table 2 Overall Heterodox-Adjusted Rankings for Ph.D.-Granting Institutions in Economics

ARL SUPPLEMENTARY STATISTICS A COMPILATION OF STATISTICS FROM THE MEMBERS OF THE ASSOCIATION OF RESEARCH LIBRARIES

HathiTrust: Ten Years, 16 Million Volumes, and the Road Ahead 2018 LIBRARY TECHNOLOGY CONFERENCE JOHN BUTLER, UNIVERSITY OF MINNESOTA

U.S. Psychology. Departments

US News and World Report Rankings Graduate Economics Programs Ranked in 2001

HathiTrust Shared Print Program Report to PAN Meeting 6/23/2017. Lizanne Payne Shared Print Program Officer

FDP Expanded Clearinghouse Participants (as of February 8, 2018)

Sears Directors' Cup Final Standings

President Dennis Assanis

U.S. Patents Awarded in 2005 Top 20 Universities

CILogon & InCommon & Federated Identity. Jim Basney

Initial (one-time) Membership Fee 10,000 Renewal Fee (every 8 years) $3500

Graduate Schools Class of 2015 Air Force Insitute of Technology Arizona State University Arrhythmia Technologies Institute ATI, Greenville, South

TROJAN SEXUAL HEALTH REPORT CARD. The Annual Rankings of Sexual Health Resources at American Colleges and Universities. TrojanBrands.

U.S. Track & Field and Cross Country Coaches Association

2009 Marketing Academia Labor Market Survey May 20, 2009

ARL ACADEMIC LAW LIBRARY STATISTICS

Appalachian State University L500030AppStUBlkVinyl. University of Alabama L500030AlabmaBlkVinyl. Arizona State University L500030ArizStBlkVinyl

Scoring Algorithm by Schiller Industries

College Matriculation ( )

April 17, 2017 Howard Hughes Medical Institute Page 1 of General Investigator Competition List of Eligible Institutions

U.S. News 2004 The Professional Schools

COLLEGE ACCEPTANCES: CLASSES

Colleges/Universities with Exercise Science/Kinesiology-related Graduate Programs

ARL ACADEMIC HEALTH SCIENCES LIBRARY STATISTICS

CSCAA NCAA Division I Scholar All-America Teams

Tuition, Fees, and Room & Board Rates Academic Year

Ethnic Studies Asst 55, ,755-2, ,111 4,111

Ethnic Studies Asst 54, ,315-3, ,229 6,229. Gen Honors/UC Asso 64, ,402-4, ,430 24,430

Hispanic Magazine. The Top 25 Colleges for Latinos

Yes, institutions can nominate a person who was previously nominated, provided they still meet the eligibility requirements of the program.

Engineering bachelor s degrees recovered in 2008

2013 Sexual Health. Report Card. The Annual Rankings of Sexual Health Resources at American Colleges and Universities BRAND CONDOMS

Registration Priority for Athletes -- Survey of Universities Updated February 2007 Alice Poehls, UNC Chapel Hill

Decline Admission to Boston College Law School Fall 2018

College Profiles - Navy/Marine ROTC

Drink Mats Grill Mats

Adlai E. Stevenson High School December 15, 2017

WHERE THE CLASS OF 2012 ATTENDS COLLEGE College Choices (Number attending is based upon where final transcript was mailed.)

List of Association of American Universities (AAU) Member Institutions

WHERE THE CLASS OF 2014 ATTENDS COLLEGE

FEDERAL R&D FUNDING BY STATE

Name. Class. Year. trojan sexual health report card edition THE ANNUAL RANKING OF SEXUAL HEALTH RESOURCES AT AMERICAN COLLEGES & UNIVERSITIES

By Brian L. Yoder, Ph.D.

AMERICAN ASSOCIATION FOR AGRICULTURAL EDUCATION FACULTY SALARIES

By Brian L. Yoder, Ph.D.

WHERE THE CLASS OF 2015 ATTENDS COLLEGE

APRIL 9-11, Team Win Loss Rank

KANG CHIAO INTERNATIONAL SCHOOL - TAIPEI. University Acceptances of Class Class 2017 Graduates: 177 students

DoD-Navy FWA Addendums

41/95/2 Student Affairs ATO Chapters Chapter Composites File,

2014 Salary and Benefits Report

Where the Class of 2016 Attends College

All-Time College Football. Attendance. All-Time NCAA Attendance. Annual Football Bowl Subdivision (FBS) Attendance. Annual Total NCAA Attendance


Public Accounting Report

TABLE 3c: Congressional Districts with Number and Percent of Hispanics* Living in Hard-to-Count (HTC) Census Tracts**

TABLE 3b: Congressional Districts Ranked by Percent of Hispanics* Living in Hard-to- Count (HTC) Census Tracts**

2017 UC Admitted Transfer Student Survey

United Kingdom Arts University of Bournemouth Central Saint Martins College of Art & Design

Mike DeSimone's 2006 College Football Division I-A Top 119 Ratings Bowl Schedule

Institutions Ineligible for AREA Grants April 2016 March 2017

The American Legion NATIONAL MEMBERSHIP RECORD

Oak Park Class of 2011 Post Graduation Plans

MEMO STEVE BERLIN, EXECUTIVE DIRECTOR, BOARD OF ETHICS, CITY OF CHICAGO

The Top American Research Universities

CAMP KESEM SWIPER1 INSTRUCTIONS PAGE TABLE OF CONTENTS

PFU DRAFT TIPS Draft Kit. Tip 1: Avoid drafting too many teams from the same conference

Oxbridge Class of 2018 College Acceptances as of 4/2/18

1. The University of Alabama 2. Alvernia University 3. American University 4. Appalachian State University 5. Arcadia University 6.

Illinois Higher Education Executive Compensation Analysis

CAIR Conference Anaheim, CA, Nov. 6-9, 2012

Statutory change to name availability standard. Jurisdiction. Date: April 8, [Statutory change to name availability standard] [April 8, 2015]

2010 College Football

CoSIDA Academic All America Who Has Had the Most?

Higher Education. Educational Matching Gift Programs

AMERICAN MOCK TRIAL ASSOCIATION 2016 TEAM NUMBERS TEAM # SCHOOL DATE REG./PAID DROP DATE 1001 Brown University 4/7/ Brown University (2nd

NSTC COMPETITIVE AREA DEFINITIONS. UIC Naval Service Training Command (NSTC), Great Lakes, IL

Colorado River Basin. Source: U.S. Department of the Interior, Bureau of Reclamation

Class of 2017 Match Results

Class 2018 Charts and Graphs. Overall Breakdown by Various Categories

Colgate University. Air Force ROTC at Illinois Institute of Tech. College of DuPage. Albion College. Allegheny College

Go Beyond Yourself At Lake Tahoe Since Squaw Valley Academy Class of 2017 Matriculation. 1 Academy of Art 4

PFU DRAFT TIPS Draft Kit. Tip 1: Avoid drafting too many teams from the same conference

Participant and Author Index

APPROVED NURSING RESEARCH COURSES FOR APRN PROGRAM

BOOTS ON THE GROUND: MAKING ACADEMIC LIBRARIES WORK FOR VETERANS

PRESS RELEASE Media Contact: Joseph Stefko, Director of Public Finance, ;

2018 Fall Silicon Valley STEM Silicon Valley, California Start Date: 10/07/2018 End Date: 10/07/2018. Exhibitor Listing. Abertay University

Table 8 Online and Telephone Medicaid Applications for Children, Pregnant Women, Parents, and Expansion Adults, January 2017

Undergraduate Schools Represented in Student Body

Index of religiosity, by state

IU Bloomington Peer Retention & Graduation Rate Comparisons

2016 NCSEA Structural Engineering Curriculum Survey

Domestic Student Recruiting Strategies

5 x 7 Notecards $1.50 with Envelopes - MOQ - 12

2 All-Time College football Attendance. All-Time NCAA Attendance. Annual Football Bowl Subdivision (FBS) Attendance

2013 U. of Iowa 86% 85% 87% 2014 U. of Colorado Boulder 84% 86% 86% U. of Nebraska Lincoln 84% 83% 82%

Transcription:

HATHITRUST A Shared Digital Repository Non-Consumptive TDM with The HathiTrust Research Center Peter Organisciak, Post-Doctoral Research Associate HTRC Exec Management: J. Stephen Downie (PI), Beth Plale (PI), Beth Namachchivaya Robert H. McDonald, Mike Furlough

Mission and Purpose To contribute to research, scholarship, and the common good by collaboratively collecting, organizing, preserving, communicating, and sharing the record of human knowledge. A trusted digital preservation service enabling the broadest possible access worldwide. An organization with over 100 research libraries partnering to develop its programs. A range of transformative programs enabled by working at a very large scale.

Allegheny College American University of Beirut Arizona State University Auburn University Baylor University Boston College Boston University Brandeis University Brown University Bryn Mawr College Carnegie Mellon University Case Western Reserve Colby College Columbia University Cornell University Dartmouth College Duke University Emory University Getty Research Institute Georgetown University Georgia Tech Harvard University Library Haverford College Indiana University Iowa State University Johns Hopkins University Kansas State University Lafayette College Library of Congress Massachusetts Institute of Technology McGill University` Michigan State University Montana State University Mount Holyoke College New Mexico State University New York Public Library New York University North Carolina Central University 3 HathiTrust Members North Carolina State University Northeastern University Northwestern University Oklahoma State University The Ohio State University The Pennsylvania State University Princeton University Purdue University Rutgers University Smith College Stanford University State University System of Florida Swarthmore College Syracuse University Temple University Texas A&M University Texas Christian University Texas Tech University Tufts University Universidad Complutense de Madrid University of Alabama University of Alberta University of Arizona University of British Columbia University of Calgary University of California Berkeley Davis Irvine Los Angeles Merced Riverside San Diego San Francisco Santa Barbara Santa Cruz California Digital Library The University of Chicago University of Connecticut University of Delaware University of Houston University of Illinois University of Illinois at Chicago The University of Iowa University of Kansas University of Maine University of Maryland University of Massachusetts, Amherst University of Miami University of Michigan University of Minnesota University of Missouri University of Nebraska-Lincoln University of Nevada-Las Vegas University of New Mexico The University of North Carolina at Chapel Hill University of Notre Dame University of Oklahoma University of Pennsylvania University of Pittsburgh University of Queensland University of Rochester University of Tennessee, Knoxville University of Texas University of Utah University of Vermont University of Virginia University of Washington University of Wisconsin-Madison University of Wyoming Utah State University Vanderbilt University Virginia Tech Wake Forest University Washington University Washington State University Yale University Library

Cooperative Work We draw upon distributed expertise Administration Michigan Indiana Illinois California Preservation & Access Repository Research Center Metadata Management (Zephir)

Scale of the HathiTrust Collection 13,893,608 total volumes 6,920,679 book titles 367,828 serial titles 4,862,762,800 pages ~625 terabytes 5,434,351 open volumes (~39% of total) The collection includes (mostly) published materials in bound form, digitized from library collections.

Contributions by Library, Nov 2015 Institution Volumes University of Michigan 4,696,618 University of Institution California Volumes 3,707,214 Harvard University University of Michigan 4,722,050 838,344 Cornell University University of California 3,639,937 584,875 University Harvard of University Wisconsin - Madison 838,122 561,700 Indiana University University of Wisconsin 561,534 530,588 Indiana University 529,798 University of Minnesota 438,134 Cornell University 515,753 University of Illinois at Urbana-Champaign 437,288 Penn State 389,247 Pennsylvania State University 390,087 University of Illinois 348,946 New York Public Library 310,737 University of Minnesota 334,249 Princeton University 252,885 New York Public Library 304,610 The Ohio State University 118,513 Princeton University 252,841 Universidad Complutense de Madrid 117,508 Universidad Complutense 117,322 Library of Congress 108,892 Library of Congress 108,892 University of Chicago 99,181 Keio University 90,122 Keio University 90,126 University of Alberta 76,106 University of Alberta 76,114 Ohio State 74,525 Columbia University 74,514 Columbia University 73,396 Northwestern University 57,142 Northwestern University 57,000 University University of Virginia of Chicago 56,98151,220 Purdue University University of Virginia 51,20747,490 University of Iowa 40,622 Technical Report Archive & Image Library 35,923 6

The top 10 languages make up ~87% of all content

HathiTrust Titles by Copyright/View Status 8

Call Number Distribution

HathiTrust Titles by Date and Viewing Status Dates

Non-Consumptive Research From the rejected Google Book Settlement: Non-Consumptive Research means research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book. (a) Image Analysis and Text Extraction (b) Textual Analysis and Information Extraction (c) Linguistic Analysis (d) Automated Translation (e) Indexing and Search

Introducing the HathiTrust Research Center

Mission of the HT Research Center Research arm of HathiTrust Established: July, 2011 Collaborative center: Indiana University & University of Illinois Mission: Enable researchers world-wide to accomplish text data-mining and analysis on texts in public domain and under copyright Enable large-scale analysis on texts (corpus > 1M volumes) Create and support tools and semantic structuring for analyzing texts Develop translational tools and data to enhance HathiTrust Digital Library services to users

HTRC Governance Reports to the HathiTrust Board of Governors HTRC Executive Committee J. Stephen Downie (Co-director), Professor and Associate Dean for Research, Univ of Illinois GSLIS Robert H. McDonald, Associate Dean of Libraries/Deputy Director Data to Insight Center at Indiana Univ Beth Sandore Namachchivaya, Associate University Librarian for Information Technology Planning & Policy at the Univ of Illinois Beth Plale (Co-director and Chair), Director Data To Insight Center and Professor in School of Informatics and Computing at Indiana Univ John Unsworth, Vice Provost for Library & Technology Services and Chief Information Officer at Brandeis Univ

Non-consumptive research through HTRC secure commons

Non-Consumptive Research Paradigm No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.

Challenge of Secure Commons ~60% of digitized texts are under copyright Text content is large : linear walk 1M volumes:1000 cores:1day. Data can t leave. Researchers want: To use their own tools, even for small analysis Intimate interaction with texts So: data can t leave yet software coming to data can be suspicious. Researchers can t consume copyright content yet still need intimate interaction.

Secure Commons Trust Ring Logical ring within which exist trusted services and computers that protect and provide access to the sensitive (copyright) data Computation moves to the data not vice versa Computation carried out in the trust ring IU UIUC

Researcher Interaction Interaction with HTRC is through one of three options: 1. Services and tools for data extraction, data cleaning, data analysis and results visualization. Self service, browser-based. 2. Check out a Data Capsule VM. Researcher checks out and configures for their use (currently for the technology savvy) 3. Direct engagement with HTRC staff HTRC Portal: https://sharc.hathitrust.org/

HTRC Portal

Searching Robinson Crusoe in the Workset Builder Basic search interface for building a workset

Results of Robinson Crusoe search using the Solr API Search and workset-creation option for the more technical user

Custom Robinson Crusoe workset request generated from MODS database Most robust search done on behalf of users who will request a custom dataset from HT

Self service portal for services and tools

We are seeing numerous cases where analysis is a pipeline: simplified into 4 stages below. Plugging in at each stage is a tool (e.g., open source, user designed, community based) Data Extraction Data Cleaning Data Analysis Visualization HathiTrust texts Input Parameters (JSON) Task output (JSON) Tasks can be programs written in any language Python, R, Java, C#, Overall Result Graphs Raw data Structured data Etc. Result: stored to workset

HTRC Data Capsule

HTRC Data Capsule concept Researcher checks out a virtual machine (VM) VM runs in the Trust Ring Researcher owns their VM through weeks/months of analysis Getting stuff into VM is easy, but there is a controlled and audited process for getting results out of the VM

@hathitrust HTRC Data Capsule HTRC Data Capsule@IU Team Beth Plale (PI) Jiaan Zeng Guangchen Ruan Special Thanks to Samitha Liyanage Milinda Pathirage Zong Peng Earlence Fernandes Ajit Aluri HTRC Data Capsule@Michigan Team Atul Prakash (PI) Alexander Crowell Jiaan Zeng, Guangchen Ruan, Alexander Crowell, Atul Prakash, and Beth Plale. 2014. Cloud computing data capsules for nonconsumptiveuse of texts. In Proceedings of the 5th ACM workshop on Scientific cloud computing (ScienceCloud '14). ACM, New York, NY, USA, 9-16. DOI=10.1145/2608029.2608031 http://doi.acm.org/10.1145/2608029.2608031

Data Capsule with i-python installed

Mode switch protection: maintenance mode Arbitrary network download allowed HTRC raw data sources User traffic from desktop allowed Data Capsule Arbitrary network upload allowed during maintenance mode, researcher installs new software and loads data into capsule

Mode switch protection: secure mode Arbitrary network download not allowed HTRC raw data sources Arbitrary network upload not allowed User traffic from desktop allowed Researcher switches to secure mode when ready to run her tools Data Capsule Results : researcher tools must write results to special directory; these are reviewed before release

Priorities Expand data capsule to support more users and run on larger number of cores Dockerize the software services Develop out the workset (user s context while working in secure commons) Switch from OAuth authentication to Shibboleth/inCommon Support broader range of canned analysis algorithms 32

Extracted Features Dataset (HTRC EF)

Features are a translation of text from language that humans understand to language that machines understand. Raw text Translation into features (you are here) Algorithmic use

Hard to make one size fit all Extracted features dataset assists in More obscure questions Functionality not in htrc Sensitivity to what is happening to data

Data https://sharc.hathitrust.org/features 1825 million pages, in 4.8 million volumes Currently restricted to public domain scanned works

Per section of each page (header/footer/body) Token count, line count, empty line count, sentence count Counts of characters occurring at the beginnings of lines, end of lines Pos-tagged token frequencies (case-sensitive) E.G. Rose (verb), rose (noun), and rose are counted separately

Possibilities Compare term counts, word clouds Within-book comparison of themes Classification against metadata (e.g. Build a genre detector!) Identify part of book (via character information) Identify chapter headings, frontispieces (via line count information) Topic modeling

CO- OCCURRENC E TABLES DAVID MIMNO

Next steps Public domain track Bigrams, trigrams Entity extraction Non-pd Copyrighted and unknown status 8 million more!

Extracted Features

ACS: Research Projects

ACS Research Project #1

ACS Research Projects #2

Tracking Technology Diffusion Through Time in the HathiTrust Corpus Michelle Alexopoulos, University of Toronto Dr. Alexopoulos, an economist, is using the vast historical record contained in the HathiTrust to study the diffusion of various technologies over time. By tracking word usage trends of 1,214 technology-related terms identified by Alexopoulos, such as the steam engine, her research based on HathiTrust book content has the potential to overturn accepted theories about the economic and societal impacts of a technology. ACS Research Projects #3 VS. Linkages to Steam Engines implied by the Library of Congress Classification From HT text: Selected subject terms linked to Steam engine n-gram by 1910 1,012,633 volumes analyzed. Over 22 hours of processing using a 32-node cluster on Indiana University s high-performance supercomputer, Big Red II. Each node had 32 cores and 64 GB of RAM. HTRC Use Case: Collaboration between Scholars and the HTRC

Use Case HT+Bookworm

HathiTrust + Bookworm A tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library http://bookworm.htrc.illinois.edu

BookWorm Components BOOKWORMAPI QUANTITATIVE QUERIES OVER COLLECTION BOOKWORMGUI THE TIME SERIES VISUALIZATION

Regularization of Verbs: A Bookworm Example burned (blue graph ) and burnt (orange graph)

Upcoming Work: New Mellon Grant WCSA+DC Funded by Andrew W. Mellon Foundation $1,117, 000 Two years intensive work rebuilding Workset Builder and Data Capsules Need to scale up Need finer access and control Improve security Partners at Brandeis, Oxford, Waikato, Illinois

Thank You!