代码之家 › 专栏 › 技术社区 › Marek

字符串返回的.NET字符串中的空格。格式与源代码中声明的空格不匹配-多个表示形式?

string .net c#

Marek · 技术社区 · 15 年前

字符串返回的字符串。格式似乎使用了一些奇怪的编码。与源代码中声明的字符串中包含的空格相比,格式字符串中包含的空格使用不同的字节值表示。

以下测试用例演示了问题:

[Test]
public void FormatSize_Regression() 
{
  string size1023 = FileHelper.FormatSize(1023);
  Assert.AreEqual("1 023 Bytes", size1023);
}

失败:

    String lengths are both 11. Strings differ at index 1.
    Expected: "1 023 Bytes"
    But was:  "1Â 023 Bytes"
    ------------^

格式大小方法:

public static string FormatSize(long size) 
{
  if (size < 1024)
     return string.Format("{0:N0} Bytes", size);
  else if (size < 1024 * 1024)
     return string.Format("{0:N2} KB", (double)((double)size / 1024));
  else
     return string.Format("{0:N2} MB", (double)((double)size / (1024 * 1024)));
}

在断言行上设置断点时,从vs immediate窗口:

size1023
"1Â 023 Bytes"

System.Text.Encoding.UTF8.GetBytes(size1023)
{byte[12]}
    [0]: 49
    [1]: 194 <--------- space is 194/160 here? Unicode bytes indicate that space should be the 160. What is the 194 then?
    [2]: 160
    [3]: 48
    [4]: 50
    [5]: 51
    [6]: 32
    [7]: 66
    [8]: 121
    [9]: 116
    [10]: 101
    [11]: 115
System.Text.Encoding.UTF8.GetBytes("1 023 Bytes")
{byte[11]}
    [0]: 49
    [1]: 32  <--------- space is 32 here
    [2]: 48
    [3]: 50
    [4]: 51
    [5]: 32
    [6]: 66
    [7]: 121
    [8]: 116
    [9]: 101
    [10]: 115

System.Text.Encoding.Unicode.GetBytes(size1023)
{byte[22]}
    [0]: 49
    [1]: 0
    [2]: 160 <----------- 160,0 here
    [3]: 0
    [4]: 48
    [5]: 0
    [6]: 50
    [7]: 0
    [8]: 51
    [9]: 0
    [10]: 32
    [11]: 0
    [12]: 66
    [13]: 0
    [14]: 121
    [15]: 0
    [16]: 116
    [17]: 0
    [18]: 101
    [19]: 0
    [20]: 115
    [21]: 0
System.Text.Encoding.Unicode.GetBytes("1 023 Bytes")
{byte[22]}
    [0]: 49
    [1]: 0
    [2]: 32 <----------- 32,0 here
    [3]: 0
    [4]: 48
    [5]: 0
    [6]: 50
    [7]: 0
    [8]: 51
    [9]: 0
    [10]: 32
    [11]: 0
    [12]: 66
    [13]: 0
    [14]: 121
    [15]: 0
    [16]: 116
    [17]: 0
    [18]: 101
    [19]: 0
    [20]: 115
    [21]: 0

问题: 这怎么可能?

6 回复 | 直到 15 年前

Jon Skeet 15 年前

我怀疑你当前的文化使用了一个有趣的“千”分隔符-U+00A0,它是不间断空格字符。老实说,那不是一个完全不合理的千人分隔符…这意味着您不应该显示这样的文本:

The size of the file is 1
023 bytes.

相反,你会得到

The size of the file is
1 023 bytes.

在我的盒子里,我得到的是“1023”。你想要你的 FormatSize 使用当前文化或特定文化的方法?如果它是当前的区域性,您可能应该让单元测试指定区域性。我有两种包装方法:

internal static void WithInvariantCulture(Action action)
{
    WithCulture(CultureInfo.InvariantCulture, action);
}

internal static void WithCulture(CultureInfo culture, Action action)
{
    CultureInfo original = Thread.CurrentThread.CurrentCulture;
    try
    {
        Thread.CurrentThread.CurrentCulture = culture;
        action();
    }
    finally
    {
        Thread.CurrentThread.CurrentCulture = original;
    }            
}

所以我可以跑:

WithInvariantCulture(() =>
{
    // Body of test
};

等。

如果要测试所得到的字符串,可以使用:

Assert.AreEqual("1\u00A0023 Bytes", size1023);

Ruben 15 年前

Unicode 160,以utf8表示不由单字节160表示,但由两个字节表示。如果不检查,我敢打赌是194+160。

事实上,超过127的任何Unicode码位都由多个字节表示。

我猜你的cultureInfo使用一个不间断的空格(160)作为一个数千个分组分隔符,而不是像你自己输入的那样简单的空格(32)。

Eamon Nerbonne 15 年前

194,160是代码点160的utf8:不间断空格-   在HTML中。

这是有道理的,你不希望一个数字被认为是几个单词。

简而言之,你的测试揭示了一个有缺陷的假设——太好了!但是,就单元测试而言,您的测试存在问题;在转换为字符串或从字符串转换为字符串时,应始终包含CultureInfo对象-否则,根据登录用户的区域性设置,单元测试可能会失败。您需要一种特殊形式的字符串格式-确保您明确地声明您期望的文化信息。

Konamiman 15 年前

也许您可以在 Assert.Equal 使用方法 CultureInfo.CurrentCulture.NumberFormat.NumberGroupSeparator 而不是一个空格字符?

J. Steen 15 年前

160是一个不间断的空格,这是有道理的,因为你不希望你的号码在两行之间被分割。但是194…哦,是的。UTF8双字节。

Jonathan van de Veen 15 年前

首先,.NET中的所有字符串都是Unicode,因此获取utf8字节是无用的。其次,在比较字符串时,应指定区域性信息,在使用string.format时,应使用iformatProvider。这样就可以控制这些函数中使用的字符。