Discussion: Permanent extension of the "Medley Character Set Standard"

Our current internal character encoding standard (MCCS) is basically XCCS with a few fiddles (dollar sign, left-arrow, which we haven't really decided on but would be handled by an explicit  XCCS to MCCS mapping).  Imporantly, MCCS codes correspond to the layout of glyphs in our fonts.  This is an initial proposal for how to extend MCCS to produce MCCS-Unicode mappings for a large number of additional Unicode characters so that we can use Unicode fonts to eliminate all of our black boxes.

First some arithmetic:

MCCS/Unicode mappings are based on the mappings for 10730 XCCS codes in 105 character sets.  Let's assume that the XCCS standard does not define any other XCCS codes that would be relevant to MCCS, i.e. we would never encounter XCCS-coded files that have standard codes outside of the mappings that we currently have.

That leaves 54805 smallp's that can be assigned to other Unicode characters.

But XCCS/MCCS does not allow 255 in any character set:
   54805 - 256 = 54549  (actually, we don't need to preserve this constraint in internal MCCS--a separate issue)

Codes in the MCCS  META and FUNCTION character sets are reserved.  
   54549 - 2*255 = 54039  (511 since we have already taken out the 255 codes)
   (We should map them to MCCS codes that are undefined in both XCCS and Unicode so that we don't have to deal with them again if we ever go to Unicode internally.)

We want to reserve some number of MCCS codes for local faking of unmapped codes, say  4 character sets = 4*255
   54039 - 4*255 = 53019

Thus we have 53019 smallp MCCS codes that can be assigned to otherwise unassigned Unicode characters.

(Side note: the maximum size of an Interlisp hash table is 32749.  So we have to divide codes into at least 2 separate hash buckets.)

The maximum number of Unicode plane 0 codes is 65535 - 6400 (reserved) = 59135 (maybe some others are not defined in the Unicode standard, I didn't check).

So we can't have smallp MCCS codes for 6116 smallp Unicodes (less whatever we want to allow from higher ups--emojis....)

A simple strategy for permanent extension to MCCS:

	Make a list of character sets (or specific characters) in Unicode that we don't care about and another list of available MCCS codes (as calculated above). 

	Then for all defined Unicode characters U from 0 to 65535 (plus others beyond that that we care about):

		Unless (UTOXCODE? U),  assign the next available MCCS code to U.

		For all the assignments thus constructed, write out MCCS-to-Unicode mapping files for the new/changed character sets.

A more sophisticated strategy, as Matt has suggested, would be to try to assign all the UNICODE characters in a given character set to MCCS codes also in a single character set, as long as a completely free character set is available.  Unicode character sets would get dispersed only after contiguous codes have been exhausted.

We can then use this permanent extension to MCCS to map the glyphs from Unicode fonts into our internal MCCS-ordered font character sets, a la the work that Matt has been doing. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Discussion: Permanent extension of the "Medley Character Set Standard" #2040

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Discussion: Permanent extension of the "Medley Character Set Standard" #2040

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions