Dealing with surrogates in JavaScript

So there are these great, but painful to program with thing, called surrogate pairs. If you’re reading this, you’re probably having trouble with them but hopefully after you read this, those troubles will be gone.

So there are these great, but painful to program with thing, called “surrogate pairs”. If you’re reading this, you’re probably having trouble with them but hopefully after you read this, those troubles will be gone.

#What? Why?

“Surrogate pairs”, as they’re called, are a thing which lets you store characters with very high code points. JavaScript is encoded in what’s called “UTF-16”, which is a way of storing strings, this means every character has 16-bits to store a unique number to identify that single character. Now this may seem all fine and dandy, but then our good friend Unicode comes to cause problems. Unicode is the standard defining what number is associated with what character. If you speak English or some language with an alphabet that’s a rational size, this may seem fine, but unfortunately there’s more. English’s entire alphabet (upper and lower case) can be represented in just 52 numbers. Now compare this to Mandarin with around 50,000 characters. They are around 7,000 languages, many with their own alphabet. Additionally symbols and numerals all need their own number to be associated with. So how many characters can be stored in UTF-16? Well, take 16, 1s and convert to decimal:

1111111111111111 = 2^{16} – 1 = 65535

Seems like a lot until you realize there’s emojis, the amount of alphabets, numbers, and symbols. This adds up quick, and it wasn’t long before the encoding guys realized that 16-bits wasn’t enough. UTF-32 doesn’t have this problem, but keep in mind this uses 32 bits per character. The letter a would be: 00000000000000000000000001100001, which is ridiculously wasteful. The entire English alphabet can be represented by 6 bits,. So they came up with surrogates, a way to represent very large code points, without being wasteful. Let’s take a look at how the smiling emoji: 😀 is represented:

'\uD83D\uDE00'

This may seem like an odd way to encode characters, but if you look at the binary representation it might look more clear:

1101100000111101 1101111000000000

#So how to deal with them?

ES6 (a new JS version), provides many solutions for better handling these characters with high code-points

#Making them

Before, you’d have figure out what the surrogates are in order to write a multi-byte char.

var emoji = '\uD83D\uDE00';

Now, with the \u{...} syntax, you can simply enter the unicode code-point of the character without having to figure out the surrogates

let emoji = '\u{1F603}';

I hope you got that
._. — I hope you got that ._.

#Iterating / Splitting them

Very often when manipulating strings, you need to reference a specific character, or loop through the string. Problem is, if you were to do:

"😀"[0]

you’d only get the first-half of the surrogate. Additionally if I were to split it ordinarily:

c // -> [", ", ", "]

The output is weirdly printed because the individual surrogates are being split and this results in a very improper UTF-16 string.

The solution is ES6’s “spread” operator. This essentially spreads an iterable (e.g. string, array) into the parent nest structure (confusing, I know), basically very similar (if not identical) to Ruby and Python’s splat. The syntax for the spread operator is: ...iterable but make sure you wrap it in another iterable to spread in. An example usage is: [1, ...[2, 3]] -> [1, 2, 3].

So how does this apply? Just use it on a string and ta-da

[..."🐶🐱🐹"] // -> ["🐶", "🐱", "🐹"]

Additionally, if you’d like to loop through the string:

for (let char of "🐶🐱🐹") {
    console.log(char); // "🐶"
                       // "🐱"
                       // "🐹"
}

Warning: this only works on native implementations of ES6, using a transpiler like Babel, this will not work.

Don’t have ES6 (or can’t support it)? Well then you can use this magic bit of code:

"🐶🐱🐹".match(/([\uD800-\uDBFF][\uDC00-\uDFFF])|[.\n]/g) // -> ["🐶", "🐱", "🐹"]

Surrogate characters have their own reserved characters for starting and ending them, because of this, we can extract either match from the string:

Match a start-surrogate, then an ending-surrogate ([\uD800-\uDBFF][\uDC00-\uDFFF])
Match any single-byte character ([.\n])

The match function then takes these matches and puts them in an array

So, yay? You know those cool multi-racial emojis on your iPhone (skin tone modifiers)? Yeah, the thing that makes them the specific color has its own character/byte to specify (e.g. is represented as <emoji_surrogate><skin_tone_modifier>). You’ll need to handle that separately:

"👶🏽".match(/([\uD800-\uDBFF][\uDC00-\uDFFF])(\uD83C[\uDFFB-\uDFFF])?|[.\n]/g)
  // -> ["👶🏽"]

The only difference between this and the other one is it has an optional: (\uD83C[\uDFFB-\uDFFF])?, which matches a skin-tone modifier character.

#Filtering

You might want to filter these high code-point characters, to avoid spam, save space, or other reasons. If you are really choosing to do this, you might just limit to printable ASCII:

str.replace(/[^\x20-\x7E]/g, "");

But if you want to remove unprintables, you’ll have to use a monster regex…

const REGEX_UNPRINTABLES = /[\0-\x1F\x7F-\x9F\xAD\u0378\u0379\u037F-\u0383\u038B\u038D\u03A2\u0528-\u0530\u0557\u0558\u0560\u0588\u058B-\u058E\u0590\u05C8-\u05CF\u05EB-\u05EF\u05F5-\u0605\u061C\u061D\u06DD\u070E\u070F\u074B\u074C\u07B2-\u07BF\u07FB-\u07FF\u082E\u082F\u083F\u085C\u085D\u085F-\u089F\u08A1\u08AD-\u08E3\u08FF\u0978\u0980\u0984\u098D\u098E\u0991\u0992\u09A9\u09B1\u09B3-\u09B5\u09BA\u09BB\u09C5\u09C6\u09C9\u09CA\u09CF-\u09D6\u09D8-\u09DB\u09DE\u09E4\u09E5\u09FC-\u0A00\u0A04\u0A0B-\u0A0E\u0A11\u0A12\u0A29\u0A31\u0A34\u0A37\u0A3A\u0A3B\u0A3D\u0A43-\u0A46\u0A49\u0A4A\u0A4E-\u0A50\u0A52-\u0A58\u0A5D\u0A5F-\u0A65\u0A76-\u0A80\u0A84\u0A8E\u0A92\u0AA9\u0AB1\u0AB4\u0ABA\u0ABB\u0AC6\u0ACA\u0ACE\u0ACF\u0AD1-\u0ADF\u0AE4\u0AE5\u0AF2-\u0B00\u0B04\u0B0D\u0B0E\u0B11\u0B12\u0B29\u0B31\u0B34\u0B3A\u0B3B\u0B45\u0B46\u0B49\u0B4A\u0B4E-\u0B55\u0B58-\u0B5B\u0B5E\u0B64\u0B65\u0B78-\u0B81\u0B84\u0B8B-\u0B8D\u0B91\u0B96-\u0B98\u0B9B\u0B9D\u0BA0-\u0BA2\u0BA5-\u0BA7\u0BAB-\u0BAD\u0BBA-\u0BBD\u0BC3-\u0BC5\u0BC9\u0BCE\u0BCF\u0BD1-\u0BD6\u0BD8-\u0BE5\u0BFB-\u0C00\u0C04\u0C0D\u0C11\u0C29\u0C34\u0C3A-\u0C3C\u0C45\u0C49\u0C4E-\u0C54\u0C57\u0C5A-\u0C5F\u0C64\u0C65\u0C70-\u0C77\u0C80\u0C81\u0C84\u0C8D\u0C91\u0CA9\u0CB4\u0CBA\u0CBB\u0CC5\u0CC9\u0CCE-\u0CD4\u0CD7-\u0CDD\u0CDF\u0CE4\u0CE5\u0CF0\u0CF3-\u0D01\u0D04\u0D0D\u0D11\u0D3B\u0D3C\u0D45\u0D49\u0D4F-\u0D56\u0D58-\u0D5F\u0D64\u0D65\u0D76-\u0D78\u0D80\u0D81\u0D84\u0D97-\u0D99\u0DB2\u0DBC\u0DBE\u0DBF\u0DC7-\u0DC9\u0DCB-\u0DCE\u0DD5\u0DD7\u0DE0-\u0DF1\u0DF5-\u0E00\u0E3B-\u0E3E\u0E5C-\u0E80\u0E83\u0E85\u0E86\u0E89\u0E8B\u0E8C\u0E8E-\u0E93\u0E98\u0EA0\u0EA4\u0EA6\u0EA8\u0EA9\u0EAC\u0EBA\u0EBE\u0EBF\u0EC5\u0EC7\u0ECE\u0ECF\u0EDA\u0EDB\u0EE0-\u0EFF\u0F48\u0F6D-\u0F70\u0F98\u0FBD\u0FCD\u0FDB-\u0FFF\u10C6\u10C8-\u10CC\u10CE\u10CF\u1249\u124E\u124F\u1257\u1259\u125E\u125F\u1289\u128E\u128F\u12B1\u12B6\u12B7\u12BF\u12C1\u12C6\u12C7\u12D7\u1311\u1316\u1317\u135B\u135C\u137D-\u137F\u139A-\u139F\u13F5-\u13FF\u169D-\u169F\u16F1-\u16FF\u170D\u1715-\u171F\u1737-\u173F\u1754-\u175F\u176D\u1771\u1774-\u177F\u17DE\u17DF\u17EA-\u17EF\u17FA-\u17FF\u180F\u181A-\u181F\u1878-\u187F\u18AB-\u18AF\u18F6-\u18FF\u191D-\u191F\u192C-\u192F\u193C-\u193F\u1941-\u1943\u196E\u196F\u1975-\u197F\u19AC-\u19AF\u19CA-\u19CF\u19DB-\u19DD\u1A1C\u1A1D\u1A5F\u1A7D\u1A7E\u1A8A-\u1A8F\u1A9A-\u1A9F\u1AAE-\u1AFF\u1B4C-\u1B4F\u1B7D-\u1B7F\u1BF4-\u1BFB\u1C38-\u1C3A\u1C4A-\u1C4C\u1C80-\u1CBF\u1CC8-\u1CCF\u1CF7-\u1CFF\u1DE7-\u1DFB\u1F16\u1F17\u1F1E\u1F1F\u1F46\u1F47\u1F4E\u1F4F\u1F58\u1F5A\u1F5C\u1F5E\u1F7E\u1F7F\u1FB5\u1FC5\u1FD4\u1FD5\u1FDC\u1FF0\u1FF1\u1FF5\u1FFF\u200B-\u200F\u202A-\u202E\u2060-\u206F\u2072\u2073\u208F\u209D-\u209F\u20BB-\u20CF\u20F1-\u20FF\u218A-\u218F\u23F4-\u23FF\u2427-\u243F\u244B-\u245F\u2700\u2B4D-\u2B4F\u2B5A-\u2BFF\u2C2F\u2C5F\u2CF4-\u2CF8\u2D26\u2D28-\u2D2C\u2D2E\u2D2F\u2D68-\u2D6E\u2D71-\u2D7E\u2D97-\u2D9F\u2DA7\u2DAF\u2DB7\u2DBF\u2DC7\u2DCF\u2DD7\u2DDF\u2E3C-\u2E7F\u2E9A\u2EF4-\u2EFF\u2FD6-\u2FEF\u2FFC-\u2FFF\u3040\u3097\u3098\u3100-\u3104\u312E-\u3130\u318F\u31BB-\u31BF\u31E4-\u31EF\u321F\u32FF\u4DB6-\u4DBF\u9FCD-\u9FFF\uA48D-\uA48F\uA4C7-\uA4CF\uA62C-\uA63F\uA698-\uA69E\uA6F8-\uA6FF\uA78F\uA794-\uA79F\uA7AB-\uA7F7\uA82C-\uA82F\uA83A-\uA83F\uA878-\uA87F\uA8C5-\uA8CD\uA8DA-\uA8DF\uA8FC-\uA8FF\uA954-\uA95E\uA97D-\uA97F\uA9CE\uA9DA-\uA9DD\uA9E0-\uA9FF\uAA37-\uAA3F\uAA4E\uAA4F\uAA5A\uAA5B\uAA7C-\uAA7F\uAAC3-\uAADA\uAAF7-\uAB00\uAB07\uAB08\uAB0F\uAB10\uAB17-\uAB1F\uAB27\uAB2F-\uABBF\uABEE\uABEF\uABFA-\uABFF\uD7A4-\uD7AF\uD7C7-\uD7CA\uD7FC-\uF8FF\uFA6E\uFA6F\uFADA-\uFAFF\uFB07-\uFB12\uFB18-\uFB1C\uFB37\uFB3D\uFB3F\uFB42\uFB45\uFBC2-\uFBD2\uFD40-\uFD4F\uFD90\uFD91\uFDC8-\uFDEF\uFDFE\uFDFF\uFE1A-\uFE1F\uFE27-\uFE2F\uFE53\uFE67\uFE6C-\uFE6F\uFE75\uFEFD-\uFF00\uFFBF-\uFFC1\uFFC8\uFFC9\uFFD0\uFFD1\uFFD8\uFFD9\uFFDD-\uFFDF\uFFE7\uFFEF-\uFFFB\uFFFE\uFFFF]/g;

str.replace(REGEX_UNPRINTABLES, "");

Yeah… you should probably do this server-side

Have any questions? Leave a comment!