CJKV Unified Ideographs Extension C

Similar documents
Guidelines for the MOST Taiwan Scholarship Program

Open to the World: International Cooperation in Horizon 2020 and the Co-Funding Mechanisms UKRO Conference Glasgow, 1 July 2016

Guidelines for the MOFA Taiwan Scholarship Program

The California State University Office of Audit and Advisory Services CSU SCHOLARSHIPS. San José State University

2018 CTCI Foundation Science and Technology Scholarships

Guidelines for the MOFA Taiwan Scholarship Program

International ICT data collection, dissemination and challenges

Directions for Tourism Bureau, MOTC Incentives for the. Promotion of Foreign Incentive Tours to Taiwan

Branch Chairman Report 2016/2017

Global Location Trends: Asia-Pacific Facts & Figures

Overview of the New Introduction to CMMI Course and Changes to the Intermediate Concepts and Instructor Training Courses

HONG KONG POSTS SECOND QUARTERLY RISE IN JOB ADVERTISEMENTS, SINGAPORE DOWN SLIGHTLY QUARTER ON QUARTER

Quarterly Monitor of the Canadian ICT Sector Third Quarter Covering the period July 1 September 30

Division of Preservation and Access Funding Opportunities. AASCU GRC Washington, DC 20 February 2015

#NEHMatters. Preservation & Access Programs at NEH

Check Hep B Patient Navigation Program

Proposed ACL Fellows Program proposed by Ken Church and Kevin Knight, approved by ACL Exec 6/19/11

ACCOMPLISHMENTS: What was done? What was learned?

Act 13 Impact Fee Revenues Frequently Asked Questions

Certificate of Proficiency. for Ship Security Officer s Determinations

Instructional Improvement Grants 1

Society for Research in Child Development 2015 Biennial Meeting March 19 21, 2015 Philadelphia, Pennsylvania, USA

A Workshop on the Comparability of Qualifications in the Health Sector within the APEC Region APEC Project HRD 07-06A SUMMARY REPORT


NSF Grad (and Other) Fellowships: Why Apply?

Digital technologies have spread rapidly

ICA Regional Conference Shanghai, China, 8-10 November Communication and Social Transformation. Call for Papers

14.54 International Trade Lecture 25: Offshoring Do Old Rules Still Apply?

ERASMUS+ INTERNATIONAL MOBILITY

Expanding Our Understanding of Complex Decision-Making in Emergent, Routine and Urgent Ethically Challenging Clinical Situations

2007 Daegu Initiative

Opportunity Grant Guidelines and Reporting Requirements

ISO/IEC JTC1/SC7 /N3020

U.S.-Funded Assistance Programs in China

Documentation of the CWE FB MC solution as basis for the formal approval-request (Brussels, 9 th May 2014)

Global Leadership for the 21st Century

CNH KEY CLUB RULES GOVERNING THE CNH KEY CLUB TREASURER AWARD

Intelligent Green Building Intelligent Industry for Smart City

Have you thought about designing uniforms for your own school? In order to encourage

Nurturing Discovery. Richard Buckius Chief Operating Officer, National Science Foundation

Global Leadership for the 21st Century

Hong Kong Association of Gerontology Seminar cum Launching Ceremony of Territory Wide Carer Support Network

The Present State of Science, Technology and Innovation Policy in Russia

ASIA-PACIFIC INFORMATION SUPERHIGHWAY (AP-IS) FOR SDG HELPDESK

Resources Guide. Helpful Grant-Related Links. Advocacy & Policy Communication Evaluation Fiscal Sponsorship Sustainability

Case-Mix Data for Case Ascertainment

Accreditation Application Verification

IHF TAIPEI st World Hospital Congress. Patient Friendly & Smarter Healthcare. November 7-9, 2017, Taipei, Taiwan PRESS KIT

National Science Foundation Annual Report Components

Trading Tasks: Globalization in the Information Age

Assignment 3: Grant Proposal

The Impact of Physician Quality Measures on the Coding Process

Explain why Japan decided to attack Pearl Harbor, and describe the attack itself.

Matching Accuracy of Patient Tokens in De-Identified Health Data Sets

Guide to the Survey Log Sheet Version 4.1

THE CPA AUSTRALIA ASIA-PACIFIC SMALL BUSINESS SURVEY 2015 GUANGZHOU REPORT

Application Procedures for Grants-in-Aid for Scientific Research-KAKENHI- FY2018

International Cooperation through Horizon IGLO Brussels, 25 February 2016

Structural Excellence Award 2016

Accountable Care A path toward accountability for health and health care

Human Services and Recreation Department ADOPTION OF PASSPORT TO FUN YOUTH SCHOLARSHIP POLICY AND PROGRAM FOR HUMAN SERVICES AND RECREATION

SUSTAINABLE HOTEL AWARD

UN I-PRO: A SYSTEM TO SCREEN LARGE HEALTH CARE DATA SETS USING SAS' William J. McDonald J. Jon Veloski Harper Consulting Group

2015 Digital Humanities Seed Grants: Call for proposals

V&A CERAMICS RESIDENCY OPEN CALL FOR APPLICATIONS

The Korean Peninsula situation after the UN resolution 2270 Wang Junsheng

Grants.gov Adobe Manual for Windows Users

Federal Funding for Native Languages: National Science Foundation s Documenting Endangered Languages Program

THE CPA AUSTRALIA ASIA-PACIFIC SMALL BUSINESS SURVEY 2015 CHINA REPORT

UNCLASSIFIED. UNCLASSIFIED R-1 Line Item #152 Page 1 of 15

2013 Digital Humanities Seed Grants: Call for proposals

Asian Open Access Meeting Report

RULES for the. EUROPEAN HEART RHYTHM ASSOCIATION (EHRA) (a Registered Branch of the ESC)

Art + Technology Lab 2018 Request for Proposals Deadline: February 21, 2018

Institutional Repository Project Summary Report Sept 2007 Sept 2010

General FAQ relating to e-submission for Veterinary Applications

National Endowment for the Humanities Workshop. Catherine Spaur, Office of Research & Sponsored Programs March 16, 2016

Success through Offshore Outsourcing. Kartik Jayaraman Director Enterprise Relationships (Strategic Accounts)

Ryan Schryver Ebling Library

Fostering Grass Roots Innovation Within Adobe

Travel Impact Report

Roadblocks to a Successful Census

International Treaty Law, decrees, & rulings

The Results of the Data Management Plan Review ( ) Presenters: Lisa Johnston and Carolyn Bishoff May 14, 2015

Pharmacy Technicians Practice and Procedures

Workshop of APEC Nearly /Net Zero Energy Building Roadmap responding to COP21

Impact 100. Women Together, Changing Lives. COMMON GRANT TRAINING

REQUEST FOR PROPOSAL Digital Archiving Project

Erasmus+ International Credit Mobility

FUDAN BIWEEKLY. For International Community on Campus Issue th June 2016

WWDC18 Scholarship Terms and Conditions

Final Report: Estimating the Supply of and Demand for Bilingual Nurses in Northwest Arkansas

honoring the past, shaping the future Chinese American Philanthropy in the Bay Area

China s Multiple Threat Vectors Toward Japan

S. ll. To provide for the improvement of the capacity of the Navy to conduct surface warfare operations and activities, and for other purposes.

GARY BAGWELL LETTERS Mss Inventory. Compiled by Luana Henderson

Netrust SSL Web Server Certificate Renewal Application Enrolment Guide

Re: Freedom of Information Act Request Regarding Targeted Violence Prevention Program

SCIENTIFIC RESEARCH COMPETITION RULES AND GUIDELINES

Families, Parks & Recreation Section SUBJECT: MEMORIALS AND MONUMENTS POLICY

Transcription:

22nd International Unicode Conference (IUC22) Unicode and the Web: Evolution or Revolution? September 9-13, 2002, San Jose, California http://www.unicode.org/iuc/iuc22/ CJKV Unified Ideographs Extension C Richard S. COOK Linguistics Department University of California, Berkeley rscook@socrates.berkeley.edu http://stedt.berkeley.edu/ 2002-09-18-10:31 INTRODUCTION This presentation is concerned with introducing the audience to some of the issues surrounding Ideographic Rapporteur Group (ISO/IEC JTC1/SC2/WG2/IRG) work on CJK Unified Ideographs Extension C (Ext C), including the following: (1) The IRG methodology constraining glyph submissions for Ext C1 (why more Han characters and which?) (2) The method of preparing glyph submissions for the Unicode Technical Committee (UTC) (3) IRG member submissions for Ext C1, introducing some of the submitted glyphs, the print sources for the glyph submissions (4) The IRG process of submission evaluation (5) The impact of submitted glyphs on the Han Variant problem (see Cook, IUC-19) (6) Plans for Ext C2 UTC submissions 22nd International Unicode Conference 1 San Jose, California, September 2002

BACKGROUND As many people already know, The Unicode Standard 3.2 is the best thing ever to happen to the digitization of Chinese texts. The immense work done to produce the CJKV 1 part of this standard, undertaken by the Ideographic Rapporteur Group (IRG) 2, has pushed CJKV computing to higher levels than many had ever thought possible. With the IRG s creation of Extension B, 42,711 new characters were added to The Unicode Standard, so that it now encodes a total of 70,207 unique ideographs. 3 The issue is somewhat complicated by things such as compatibility characters which are not actually compatibility characters. The last totals available to me (provided by Mr. John Jenkins with the advent of Unicode 3.1) are as follows: Figure 1: Unicode 3.1: Total Unique CJKV Ideographs 27,484 27,496 42,711 70,207 CJKUI, CJKUIA p. 258 of The Unicode Standard 3.0 CJKUI, CJKUIA including 12 compatibility ideographs that are not compatibility ideographs CJKUIB Extension B Total number of unique ideographs in Unicode 3.1 Following completion of Ext. B, the IRG began work to prepare yet more unencoded characters for encoding. This was originally termed CJK Unified Ideographs Extension C. Preliminary reports from an IRG meeting in Hong Kong indicated that the IRG Rapporteur anticipated submission of some 67,000 candidate ideographs, as these figures (provided to me by Mr. Hideki HIURA) indicate: 1Chinese, Japanese, Korean, Vietnamese. 2<http://www.cse.cuhk.edu.hk/~irg/> 3The term ideograph is a technical usage defined in the glossary of the Unicode Standard, a compromise term equivalent to CJKV character. 22nd International Unicode Conference 2 San Jose, California, September 2002

Figure 2: Preliminary Ext. C1 Submission Totals ROK 23000+~20000 TCA 18000 PRC 4570 Japan ~200 Macau ~200 Vietnam 1049 HKSAR 9 DPRK 94 On the basis of these preliminary figures, it was decided to divide submissions into two parts, for Extensions C1 and C2. Extension C1 submissions should be those unencoded characters with most immediate relevance to modern usage, while glyphs of less clear status should be reserved for Extension C2 submission. EXT C1 SUBMISSIONS At the most recent IRG meeting (IRG-19, held in Macau at the end of April 2002), a total of 26,079 glyphs were submitted by 9 IRG members for inclusion in Ext C1. The breakdown of submissions per member is as follows (sorted by descending number of submissions): 22nd International Unicode Conference 3 San Jose, California, September 2002

Figure 3: Final Ext. C1 Submission Totals TCA 10659 (Taiwan, ROC) China 07650 (Mainland PRC) ROK 04073 (South Korea) Vietnam 02286 Japan 00970 UTC 00271 (Unicode/US) DPRK 00094 (North Korea) HK 00029 (Hong Kong) Singapore 00025 Macau 00022 Total 26079 The primary constraint on CJKV submissions is ISO 10646-1 Annex S, which lays out the basic rules determining what the Character Glyph Model means for CJKV. The specific format for glyph submissions required (1) a bitmapped representation of the proposed character; (2) certain tabulated information on each submitted character, including the following: Figure 4: IRG Submission Format class: Kang Xi Residual Strokes Source Variant field: Virtual Index Rad. + flag Count 1st Type Info & ID USV format: XXXX.YYZ XXXXY N 1..5 SSSNNNNN U1(,U2) bytes: 1-8 9-13 14-15 16 17-24 25-35 22nd International Unicode Conference 4 San Jose, California, September 2002

THE UTC SUBMISSIONS UTC submissions for Ext. C1 were prepared by myself, Mr. Jenkins (Apple Computer), Tom Bishop (Wenlin Software) and Cora Chang (Apple Computer). The process of glyph collection began several years ago with Mr. Bishop s work on the ABC Dictionary (University of Hawaii), in which he identified several unencoded simplified characters. Added to this initial batch of candidates for submission were a collection received by Mr. Jenkins from the LDS church in HK. Finally a number of candidate characters were drawn from my own work proofing the Unihan.txt data, and digitizing two large ancient Chinese character lexicons, Shuowen Jiezi (c. 121AD) 4, and Guangyun (c. 1000AD). Several other simplified candidates for encoding came to our knowledge in emails from Unicode users. Once the initial candidates for submission had been collected, the hard work began. This included the following: Figure 5: Steps in Preparing UTC Submissions (1) creation of a prototype glyph (2) creation of a new record for that glyph in our central Unihan Additions database (3) entry of relevant data, including glyph prototype (see Figure 4 above) (4) checking our candidates against the Unicode CJKV character set Prototyping of the candidates for encoding was done using undocumented features of a new version of Wenlin software (scheduled for public release in the summer of 2002). Images of each of the candidate glyphs was created using a component-based method, producing images such as the following: 4See my IUC-18 paper. 22nd International Unicode Conference 5 San Jose, California, September 2002

Figure 6: Six Example Prototypes of UTC Ext. C1 Submissions Once the glyph had been prototyped, it was assigned a UTC number in a record in the Unihan Additions database (FileMaker Pro 5). The prototype glyph was placed in a container field of that record, and the accompanying information for that glyph was entered. Altogether, 312 glyphs were prototyped, though in the checking process (step 4 above) 41 candidates were eliminated as having already been encoded, bringing the final total of UTC submissions to 271. SUBMISSION REVIEW The glyphs submitted by the IRG members were pooled and sorted by the IRG Rapporteur and his team, and 4 large PDF s were created, listing the 26078 raw glyph submissions. (This number is one shy of the final total of 26079 glyphs, as 1 additional glyph was voted in after the initial PDF s had been prepared). Several sessions of the Macao meeting were devoted to preliminary evaluation of submissions. The work was divided among the ~40 delegates, and the submission data (see Figure 4 above) underwent the first verification pass. Proofing of the submission data is at present still going on, and it is unclear exactly how much work remains to be done. This will become more clear with the IRG-20 meeting, scheduled for Hanoi in November of 2002. 22nd International Unicode Conference 6 San Jose, California, September 2002

EXT C SUBMISSIONS AND THE VARIANT QUESTION In reviewing the glyph submissions for Ext. C1, it appears that the character vs. glyph distinction for CJKV ideographs is still a lively topic of debate. As mentioned in my IUC-19 paper, the distinctions made in ISO/IEC 10646 Annex S do not seem quite up to the task of dealing with the many variant CJKV ideographs. An adequate standard method for quantifying CJKV glyph variation as yet does not exist, though it seems likely that one will in fact be devised on the basis of the Ext B and C work. In my presentation slides I discuss examples of the member glyph submissions, and their relation to encoded glyphs, as well as their relation to the variant question. UTC SUBMISSIONS FOR EXT. C2 In addition to proofing the Ext C1 submissions, IRG members are now busy collecting and refining submission candidates for Ext C2. The initial mapping work described in my IUC-18 presentation has now progressed to an advanced state, such that I have collected several thousand glyphs which, by the criteria set forth in Annex S are valid candidates for encoding. Many of these glyphs are, however, identified in my mapping tables as variants of encoded glyphs, and should for this reason be treated with a variant selector mechanism rather than being separately encoded. Lacking a mechanism for dealing with such variants, it seems likely that many more variant glyphs will be submitted for encoding in Ext C2. Until the 26079 Ext C1 submissions have been fully digested, it s hard to even begin thinking about Ext C2 submissions. The following is an example of a glyph which might end up in a UTC Ext C2 submission. Whether or not it actually ends up being submitted depends on whether it is somewhere in the Ext. C1 submissions. At the moment of writing, I just don t know for sure. I ll go check, and you do too. 22nd International Unicode Conference 7 San Jose, California, September 2002

Figure 7: Candidate UTC Ext. C2 Submission SUMMARY In summary, it may be said that the IRG s task of evaluating the Extension C1 glyph submissions is an enormous one. The largest problems at present relate to cross-checking the C1 submissions against the enormous encoded characterset. As the encoded characterset grows, such problems only grow with it. The IRG submission and evaluation procedures require much manual human intervention and subjectivity, leaving room for error. As the encoding work continues, guidelines such as those in Annex S must be refined, and standards relating to such things as stroke-count, stroke-type, and component type must codified. ACKNOWLEDGMENTS The writing of this paper was supported in part by grants from: The National Science Foundation (NSF), Division of Behavioral & Cognitive Sciences, Linguistics, Grant No. BCS-9904950; The National Endowment for the Humanities (NEH), Preservation and Access, Grant No. PA-23353-99. Thanks to John JENKINS and Hideki HIURA for their kind help and suggestions. Thanks also to Tom BISHOP: his work on Wenlin <http://www.wenlin.com/> is now as always an inspiration. 22nd International Unicode Conference 8 San Jose, California, September 2002