代码之家  ›  专栏  ›  技术社区  ›  Atto Allas

运行时检查失败#2仅使用utf8proc进行MSVC调试

  •  0
  • Atto Allas  · 技术社区  · 6 年前

    对于一些与utf-8相关的操作,我使用的是C库 utf8proc .

    问题

    当使用最新的MSVC 15构建调试目标时,使用此代码运行测试程序(基本上与打印此函数的结果一样简单),会产生一个错误,即:

    [关于哪个exe文件失败的一些信息]

    任何其他编译器(我已经尝试过)或发行版目标都不会给出这个错误,而是为我抛出的任何东西提供正确的输出。

    第一,记忆 codepoint character 有时似乎是随机改变的(因此节省的黑客缓解) codepointCopy

    其次, 性格 ,一旦编码,有时会有奇怪的尾随字符(我假设是因为未初始化的内存,但是,尝试在 性格 memset 没有帮助,是否有明显的遗漏?)因此 .substr(0, charSize) 到目前为止效果还不错。

    代码

    #include <string>
    
    #include "../include/utf8proc.h"
    
    std::string calculateUnicodeNormalization(const std::string &in, const std::string &mode) {
        auto pString = (const utf8proc_uint8_t*) in.c_str();
    
        utf8proc_uint8_t* pOutString;
        // These two functions are from c and use malloc to allocate memory, so I free with free()
        if (mode == "NFC") {
            pOutString = utf8proc_NFC(pString);
        } else {
            pOutString = utf8proc_NFD(pString);
        }
    
        // Converts to a string
        std::string retString = std::string((const char*) pOutString);
        // Frees what was allocated by malloc
        free(pOutString);
    
        return retString;
    }
    
    std::string removeAccents(const std::string &in) {
        std::string decomposedString = calculateUnicodeNormalization(in, "NFD");
        auto pDecomposedString = (const utf8proc_uint8_t*) decomposedString.c_str();
    
        size_t offset = 0;
        std::string rebuiltString;
        // Iterates through all of the characters, adding to the "offset" each time so the next character can be found
        while (true) {
            utf8proc_int32_t codepoint;
    
            // This function takes a pointer to a uint8_t array and writes the next unicode character's codepoint into codepoint.
            // The -1 means it reads up to 4 bytes (the max length of a utf-8 character).
            utf8proc_iterate(pDecomposedString + offset, -1, &codepoint);
    
            // Null terminator, end of string
            if (codepoint == 0) {
                break;
            }
    
            const utf8proc_int32_t codepointCopy = codepoint;
    
            utf8proc_uint8_t character;
            // This function takes a codepoint and puts the encoded utf-8 character into "character". It returns the bytes written.
            auto charSize = (size_t) utf8proc_encode_char(codepointCopy, &character);
    
            // I had been having some problems with trailing random characters (random unicode), but this seemed to fix it.
            // Could that have been related to the error?
            std::string realChar = std::string((const char*) &character).substr(0, charSize);
    
            // God knows why this is needed, but the above function call seems to somehow alter codepoint
            // Could be to do with the error?
            codepoint = codepointCopy;
    
            // Increments offset so the next character now would be read
            offset += charSize;
    
            // The actual useful part of the function: gets the category of the codepoint, and if it is Mark, Nonspacing (and not an iota subscript),
            // does not add it to the rebuilt string
            if ((utf8proc_category(codepoint) == UTF8PROC_CATEGORY_MN) && (codepoint != 0x0345)) {
                continue;
            }
    
            rebuiltString += realChar;
        }
    
        // Returns the composed form of the rebuilt string
        return calculateUnicodeNormalization(rebuiltString, "NFC");
    }
    

    #include <iostream>
    
    int main() {
        std::cout << removeAccents("ᾤκεον") << std::endl;
    }
    

    期待一个结果。

    我不太确定到底发生了什么,而且在我看来也没有什么明显的记忆错误(我的意思是,它在其他方面似乎工作得很好),但当然,由于我的经验不足,我可能错过了一些东西。

    1 回复  |  直到 4 年前
        1
  •  2
  •   Alan Birtles    6 年前
    utf8proc_uint8_t character;
    // This function takes a codepoint and puts the encoded utf-8 character into "character". It returns the bytes written.
    auto charSize = (size_t) utf8proc_encode_char(codepointCopy, &character);
    

    这会将最多4个字节写入单字节变量 character 从而腐蚀你的堆栈。

     std::string((const char*) &character).substr(0, charSize);
    

    效率更高,也不那么草率( &character

     std::string((const char*) &character, charSize);
    

    或者更好:

     rebuiltString.append((const char*) &character, charSize);