代码之家  ›  专栏  ›  技术社区  ›  Delan Azabani

BMP之外的JavaScript字符串

  •  37
  • Delan Azabani  · 技术社区  · 14 年前

    Basic Multilingual Plane

    根据 JavaScript:好的部分 :

    JavaScript是在Unicode是16位字符集的时候构建的,因此JavaScript中的所有字符都是16位宽的。

    进一步调查证实:

    > String.fromCharCode(0x20001);
    

    这个 fromCharCode 方法在返回Unicode字符时似乎只使用最低的16位。尝试获取U+20001(中日韩统一象形文字20001)返回U+0001。

    问:在JavaScript中处理post-BMP字符有可能吗?


    好的,坏的,和(大部分)丑陋的

    5 回复  |  直到 10 年前
        1
  •  35
  •   bobince    10 年前

    取决于你所说的支持。您当然可以使用代理将非UCS-2字符放入JS字符串中,如果可以,浏览器将显示它们。

    但是,JS字符串中的每一项都是一个单独的UTF-16代码单元。语言级别不支持处理完整字符:所有标准字符串成员( length , split , slice etc)都处理代码单元,而不是字符,因此将非常高兴地拆分代理项对或保留无效的代理项序列。

    如果你想要代理意识的方法,恐怕你得自己动手写了!例如:

    String.prototype.getCodePointLength= function() {
        return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1;
    };
    
    String.fromCodePoint= function() {
        var chars= Array.prototype.slice.call(arguments);
        for (var i= chars.length; i-->0;) {
            var n = chars[i]-0x10000;
            if (n>=0)
                chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));
        }
        return String.fromCharCode.apply(null, chars);
    };
    
        2
  •  2
  •   ecellingsworth    11 年前

    我重新实现了以下方法,将每个unicode代码点视为单个字符:.length、.charCodeAt、.fromCharCode、.charAt、.indexOf、.lastIndexOf、.splice和.split。

    http://jsfiddle.net/Y89Du/

    if (!String.prototype.ucLength) {
        String.prototype.ucLength = function() {
            // this solution was taken from 
            // http://stackoverflow.com/questions/3744721/javascript-strings-outside-of-the-bmp
            return this.length - this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length + 1;
        };
    }
    
    if (!String.prototype.codePointAt) {
        String.prototype.codePointAt = function (ucPos) {
            if (isNaN(ucPos)){
                ucPos = 0;
            }
            var str = String(this);
            var codePoint = null;
            var pairFound = false;
            var ucIndex = -1;
            var i = 0;  
            while (i < str.length){
                ucIndex += 1;
                var code = str.charCodeAt(i);
                var next = str.charCodeAt(i + 1);
                pairFound = (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF);
                if (ucIndex == ucPos){
                    codePoint = pairFound ? ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000 : code;
                    break;
                } else{
                    i += pairFound ? 2 : 1;
                }
            }
            return codePoint;
        };
    }
    
    if (!String.fromCodePoint) {
        String.fromCodePoint = function () {
            var strChars = [], codePoint, offset, codeValues, i;
            for (i = 0; i < arguments.length; ++i) {
                codePoint = arguments[i];
                offset = codePoint - 0x10000;
                if (codePoint > 0xFFFF){
                    codeValues = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)];
                } else{
                    codeValues = [codePoint];
                }
                strChars.push(String.fromCharCode.apply(null, codeValues));
            }
            return strChars.join("");
        };
    }
    
    if (!String.prototype.ucCharAt) {
        String.prototype.ucCharAt = function (ucIndex) {
            var str = String(this);
            var codePoint = str.codePointAt(ucIndex);
            var ucChar = String.fromCodePoint(codePoint);
            return ucChar;
        };
    }
    
    if (!String.prototype.ucIndexOf) {
        String.prototype.ucIndexOf = function (searchStr, ucStart) {
            if (isNaN(ucStart)){
                ucStart = 0;
            }
            if (ucStart < 0){
                ucStart = 0;
            }
            var str = String(this);
            var strUCLength = str.ucLength();
            searchStr = String(searchStr);
            var ucSearchLength = searchStr.ucLength();
            var i = ucStart;
            while (i < strUCLength){
                var ucSlice = str.ucSlice(i,i+ucSearchLength);
                if (ucSlice == searchStr){
                    return i;
                }
                i++;
            }
            return -1;
        };
    }
    
    if (!String.prototype.ucLastIndexOf) {
        String.prototype.ucLastIndexOf = function (searchStr, ucStart) {
            var str = String(this);
            var strUCLength = str.ucLength();
            if (isNaN(ucStart)){
                ucStart = strUCLength - 1;
            }
            if (ucStart >= strUCLength){
                ucStart = strUCLength - 1;
            }
            searchStr = String(searchStr);
            var ucSearchLength = searchStr.ucLength();
            var i = ucStart;
            while (i >= 0){
                var ucSlice = str.ucSlice(i,i+ucSearchLength);
                if (ucSlice == searchStr){
                    return i;
                }
                i--;
            }
            return -1;
        };
    }
    
    if (!String.prototype.ucSlice) {
        String.prototype.ucSlice = function (ucStart, ucStop) {
            var str = String(this);
            var strUCLength = str.ucLength();
            if (isNaN(ucStart)){
                ucStart = 0;
            }
            if (ucStart < 0){
                ucStart = strUCLength + ucStart;
                if (ucStart < 0){ ucStart = 0;}
            }
            if (typeof(ucStop) == 'undefined'){
                ucStop = strUCLength - 1;
            }
            if (ucStop < 0){
                ucStop = strUCLength + ucStop;
                if (ucStop < 0){ ucStop = 0;}
            }
            var ucChars = [];
            var i = ucStart;
            while (i < ucStop){
                ucChars.push(str.ucCharAt(i));
                i++;
            }
            return ucChars.join("");
        };
    }
    
    if (!String.prototype.ucSplit) {
        String.prototype.ucSplit = function (delimeter, limit) {
            var str = String(this);
            var strUCLength = str.ucLength();
            var ucChars = [];
            if (delimeter == ''){
                for (var i = 0; i < strUCLength; i++){
                    ucChars.push(str.ucCharAt(i));
                }
                ucChars = ucChars.slice(0, 0 + limit);
            } else{
                ucChars = str.split(delimeter, limit);
            }
            return ucChars;
        };
    }
    
        3
  •  1
  •   Michael Allan    7 年前

    最近的JavaScript引擎 String. fromCodePoint

    const ideograph = String.fromCodePoint( 0x20001 ); // outside the BMP
    

    也是 code-point iterator ,这将获得代码点长度。

    function countCodePoints( str )
    {
        const i = str[Symbol.iterator]();
        let count = 0;
        while( !i.next().done ) ++count;
        return count;
    }
    
    console.log( ideograph.length ); // gives '2'
    console.log( countCodePoints(ideograph) ); // '1'
    
        4
  •  0
  •   Jukka K. Korpela    12 年前

    Full Unicode Input 实用程序。

    使用合适的工具和设置,您可以编写 var foo = '𠀁'

    非BMP字符将在内部表示为代理项对,因此每个非BMP字符在字符串长度中计为2。

        5
  •  0
  •   Simon Hi    6 年前

    使用 for (c of this) 指令,可以对包含非BMP字符的字符串进行各种计算。例如,要计算字符串长度并获取字符串的第n个字符:

    String.prototype.magicLength = function()
    {
        var c, k;
        k = 0;
        for (c of this) // iterate each char of this
        {
            k++;
        }
        return k;
    }
    
    String.prototype.magicCharAt = function(n)
    {
        var c, k;
        k = 0;
        for (c of this) // iterate each char of this
        {
            if (k == n) return c + "";
            k++;
        }
        return "";
    }