代码之家 › 专栏 › 技术社区 › trampster

从ReadOnlySequence分析UTF8字符串<byte>

.net c#

trampster · 技术社区 · 5 年前

如何从ReadOnlySequence解析UTF8字符串

ReadOnlySequence是由部分组成的,由于UTF8字符是可变长度的,因此部分的中断可能在字符的中间。所以简单的使用编码.UTF8.GetString()并将它们组合到StringBuilder中将不起作用。

是否可以从ReadOnlySequence解析UTF8字符串,而不首先将它们组合到一个数组中。我宁愿在这里避免内存分配。

0 回复 | 直到 5 年前

user1781290 5 年前

你可以带一个 Decoder . 应该是这样的:

var decoder = Encoding.UTF8.GetDecoder();
var sb = new StringBuilder();
var processed = 0L;
var total = bytes.Length;
foreach (var i in bytes)
{
    processed += i.Length;
    var isLast = processed == total;
    var span = i.Span;
    var charCount = decoder.GetCharCount(span, isLast);
    Span<char> buffer = stackalloc char[charCount];
    decoder.GetChars(span, buffer, isLast);
    sb.Append(buffer);
}

从 the docs:

这个解码器.GetChars方法将字节的顺序块转换为字符的顺序块,方式类似于此类的GetChars方法。但是,解码器在调用之间保持状态信息,以便正确地解码跨越块的字节序列。解码器还保留数据块末尾的尾随字节,并在下一次解码操作中使用尾随字节。因此,GetDecoder和GetEncoder对于网络传输和文件操作非常有用,因为这些操作通常处理数据块而不是完整的数据流。

当然了 StringBuilder 会为分配引入一个新的源,但是如果这是一个问题,您可以用其他类型的缓冲区来替换它。

Marc Gravell 4 年前

我们在这里首先要做的是测试序列是否真的是单个跨度;如果是,我们可以 极大地 简化和优化。

一旦我们知道我们有一个多段(不连续)缓冲区,有两种方法:

将段线性化为连续的缓冲区,可能从ArrayPool.共享,并在租用缓冲区的正确部分使用UTF8.GetString,或
使用 GetDecoder() API,并使用它来填充一个新的字符串,这在旧的框架中意味着覆盖一个新分配的字符串,或者在新的框架中意味着使用 string.Create 美国石油学会

第一种选择是 非常简单 ,但涉及一些内存复制操作(但除了字符串之外没有其他分配):

public static string GetString(in this ReadOnlySequence<byte> payload,
    Encoding encoding = null)
{
    encoding ??= Encoding.UTF8;
    return payload.IsSingleSegment ? encoding.GetString(payload.FirstSpan)
        : GetStringSlow(payload, encoding);

    static string GetStringSlow(in ReadOnlySequence<byte> payload, Encoding encoding)
    {
        // linearize
        int length = checked((int)payload.Length);
        var oversized = ArrayPool<byte>.Shared.Rent(length);
        try
        {
            payload.CopyTo(oversized);
            return encoding.GetString(oversized, 0, length);
        }
        finally
        {
            ArrayPool<byte>.Shared.Return(oversized);
        }
    }
}

Kind Contributor 4 年前

警告: 未测试

我对官方答案做了改进:

打包为扩展方法
不再需要 StringBuilder ,通过预先分配字符数组的高估值
不需要额外的 GetCharCount 第二步,使用单个大数组,求出 GetChars ,并移动目标跨度切片
重命名一些变量。 preProcessedBytes 对我来说特别重要的是,在我看来,它们只有在派生出字符之后才会被处理。
使用stringLengthEstimate参数,以便它可以用于字符串长度(以字符计数表示)在UTF8字节之前存储为标头的协议

以下是源代码:

/// <summary>
/// Parses UTF8 characters in the ReadOnlySequence
/// </summary>
/// <param name="slice">Aligned slice of ReadOnlySequence that contains the UTF8 string bytes. Use slice before calling this function to ensure you have an aligned slice.</param>
/// <param name="stringLengthEstimate">The amount of characters in the final string. You should use a header before the string bytes for the best accuracy. If you are not sure -1 means that the most pessimistic estimate will be used: slice.Length</param>
/// <returns>a string parsed from the bytes in the ReadOnlySequence</returns>
public static string ParseAsUTF8String(this ReadOnlySequence<byte> slice, int stringLengthEstimate = -1)
{
    if (stringLengthEstimate == -1)
        stringLengthEstimate = (int)slice.Length; //overestimate
    var decoder = Encoding.UTF8.GetDecoder();
    var preProcessedBytes = 0;
    var processedCharacters = 0;
    Span<char> characterSpan = stackalloc char[stringLengthEstimate]; 
    foreach (var memory in slice)
    {
        preProcessedBytes += memory.Length;
        var isLast = (preProcessedBytes == slice.Length);
        var emptyCharSlice = characterSpan.Slice(processedCharacters, characterSpan.Length - processedCharacters);
        var charCount = decoder.GetChars(memory.Span, emptyCharSlice, isLast);
        processedCharacters += charCount;
    }
    var finalCharacters = characterSpan.Slice(0, processedCharacters);
    return new string(finalCharacters);
}