代码之家 › 专栏 › 技术社区 › BugShotGG

如何正确地将托洛尔()应用于C中的德语大写字母,,á++

character-encoding stl c++

BugShotGG · 技术社区 · 6 年前

我有点困惑,因为我打开了一个 question ,我想在这里更具体一点。

我有很多包含德语字母的文件,大部分是 iso-8859-15 或 UTF-8 编码。为了处理它们,必须将所有字母转换为小写。

例如,我有一个文件(编码为 iso-8859-15 )包含:

Rose博士在M.Das sogen。温兰德洗礼馆,艺术博物馆。 “Baukunst”(S.496)索尔纳赫格兰德战争遗留爆炸物帕尔弗雷的《新英格兰历史》一书 1670年,埃尔巴特·温姆勒·塞因。Vgl。一阵强风“Jahrbchern der”谷风暴 Kopenhagen的nordische Altertumskunde的kniglichen Gesellschaft 1887年,第296条。

·阿拉贝罗

文本 ÃÃ¤ ÃÃ¶ ÃÃ¼ áºÃ Ãrebro 应成为: Ã¤Ã¤ Ã¶Ã¶ Ã¼Ã¼ ÃÃ Ã¶rebro 。

然而 tolower() 似乎不适用于大写字母,例如,,áeventhough,我尝试强制使用区域设置,如中所述 this SO post

以下是我在另一个问题中发布的代码:

std::vector<std::string> tokens;
std::string filename = "10223-8.txt";
//std::string filename = "test-UTF8.txt";
std::ifstream inFile;

//std::setlocale(LC_ALL, "en_US.iso88591");
//std::setlocale(LC_ALL, "de_DE.iso88591");
//std::setlocale(LC_ALL, "en_US.iso88591");
//std::locale::global(std::locale(""));

inFile.open(filename);
if (!inFile) { std::cerr << "Failed to open file" << std::endl; exit(1); }

std::string s = "";
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
    s.append(line + "\n");
}
inFile.close();

std::cout << s << std::endl;

//std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
    if (std::ispunct(s[i]) || std::isdigit(s[i]))
            s[i] = ' ';
    if (std::isupper(s[i]))
            s[i] = std::tolower(s[i]);
            //s[i] = std::tolower(s[i]);
            //s[i] = std::tolower(s[i], std::locale("de_DE.utf8"))
}

std::cout << s << std::endl;

//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};

//PROCESS TOKENS...

这确实令人沮丧,关于 <locale> 。

因此,除了我的代码的主要问题外,还有一些问题:

我是否也必须在其他函数中应用某种自定义区域设置( isupper() ,则, ispunct() …)?
我需要吗 de_DE 在my linux中启用或安装了区域设置 env 要正确处理字符串的字符?
以同样的方式处理文本是否安全 std::string 那就是从不同编码(iso-8859-15或UTF-8)的文件中提取?

编辑:KonradRudolph answer仅适用于UTF-8文件。它不适用于iso-8859-15,iso-8859-15转化为此处发布的初始问题: How to apply functions on text files with different encoding in c++

1 回复 | 直到 6 年前

Konrad Rudolph 6 年前

使用 std::ctype::tolower 不 std::tolower :

#include <iostream>
#include <locale>

int main() {
    std::locale::global(std::locale("de_DE.UTF-8"));
    std::wcout.imbue(std::locale());
    auto& f = std::use_facet<std::ctype<wchar_t>>(std::locale());
    std::wstring str = L"ÃÃ¤ ÃÃ¶ ÃÃ¼ áºÃ Ãrebro";
    f.tolower(&str[0], &str[0] + str.size());
    std::wcout << "'" << str << "'\n";
}

您也可以创建本地语言环境(heh),而不是设置全局语言环境:

std::locale loc("de_DE.UTF-8");
std::wcout.imbue(loc);
auto& f = std::use_facet<std::ctype<wchar_t>>(loc);

这是编译和工作的。在我的系统中,它正确地转换了umlauts,但它无法处理大写字母-Ã(老实说,这并不奇怪)。

此外,请注意此函数的限制:它只能执行1对1字符转换。在Unicode标准的早期版本中,正确的大写转换是SS。 std::ctype::toupper 明确不支持此操作。