代码之家 › 专栏 › 技术社区 › Manglu

如何通过脚本在Unix中查找文件编码

encoding unix shell file

232

Manglu · 技术社区 · 15 年前

我需要找到目录中所有文件的编码。有没有办法找到所使用的编码?

这个 file

我感兴趣的编码是 ISOÂ 8859-1

15 回复 | 直到 3 年前

503

Peter Mortensen user1284631 3 年前

听起来你在找我 enca . 它可以猜测,甚至在编码之间转换。只要看看这张照片 man page .

否则,使用 file -i file -I (OSX)。这将输出文件的MIME类型信息,其中还包括字符集编码。我找到了一个 man-page 对它来说也是:)

madu 12 年前

file -bi <file name>

for f in `find | egrep -v Eliminate`; do echo "$f" ' -- ' `file -bi "$f"` ; done

qwert2003 3 年前

uchardet -从Mozilla移植的编码检测器库。

~> uchardet file.java
UTF-8

各种Linux发行版( Debian , Ubuntu , openSUSE Pacman 等)提供二进制文件。

not2qubit 6 年前

encguess :

$ encguess test.txt
test.txt  US-ASCII

Peter Mortensen user1284631 3 年前

下面是一个使用 file -I iconv 它在MacOSX上工作。

mv iconv :

#!/bin/bash
# 2016-02-08
# check encoding and convert files
for f in *.java
do
  encoding=`file -I $f | cut -f 2 -d";" | cut -f 2 -d=`
  case $encoding in
    iso-8859-1)
    iconv -f iso8859-1 -t utf-8 $f > $f.utf8
    mv $f.utf8 $f
    ;;
  esac
done

Peter Mortensen user1284631 3 年前

要将编码从ISO8859-1转换为ASCII:

iconv -f ISO_8859-1 -t ASCII filename.txt

Peter Mortensen user1284631 3 年前

很难确定它是否是ISO8859-1。如果您的文本只有7位字符,也可以是ISO8859-1,但您不知道。如果您有8位字符,则上部区域字符也存在于顺序编码中。因此,你必须使用字典来更好地猜测它是哪个单词,并从中确定它必须是哪个字母。最后,如果您检测到它可能是UTF-8,那么您确定它不是ISO8859-1。

Peter Mortensen user1284631 3 年前

使用Python,您可以使用 chardet 单元

Peter Mortensen user1284631 3 年前

这不是你可以用万无一失的方式做的事情。一种可能是检查文件中的每个字符,以确保其不包含范围内的任何字符 0x00 - 0x1f 0x7f -0x9f 但是,正如我所说的,这可能适用于任何数量的文件,包括至少一个ISO8859的其他变体。

另一种可能是在文件中以支持的所有语言查找特定单词,并查看是否可以找到它们。

因此,例如,在ISO8859-1的所有支持语言中查找英语“and”、“but”、“to”、“of”等的等效项,并查看它们是否在文件中大量出现。

English   French
-------   ------
of        de, du
and       et
the       le, la, les

尽管这是可能的。我说的是目标语言中的常用词(据我所知,冰岛语中没有“和”这个词——你可能不得不用它们的词来表示“鱼”[对不起,这有点老套。我不是有意冒犯,只是想说明一点])。

Peter Mortensen user1284631 3 年前

使用此命令:

for f in `find .`; do echo `file -i "$f"`; done

如果文件名中有空格,请使用:

IFS=$'\n'
for f in `find .`; do echo `file -i "$f"`; done

记住,它会将您当前的Bash会话解释器更改为“空格”。

wkschwartz 12 年前

我知道您对一个更一般的答案感兴趣,但ASCII的优点通常适用于其他编码。下面是一个Python单行程序,用于确定标准输入是否为ASCII。(我很确定这在Python 2中是可行的,但我只在Python 3上测试过它。)

python -c 'from sys import exit,stdin;exit()if 128>max(c for l in open(stdin.fileno(),"b") for c in l) else exit("Not ASCII")' < myfile.txt

Peter Mortensen user1284631 3 年前

<?xml version="1.0" encoding="ISO-8859-1" ?> 因此,您可以使用正则表达式(例如,使用Perl)检查每个文件中是否存在此类规范。

更多信息可在此处找到: How to Determine Text File Encoding .

Peter Mortensen user1284631 3 年前

在PHP中,您可以按如下方式进行检查:

明确指定编码列表:

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

mb_列表_编码()

注意,mb_*函数需要php mbstring:

apt-get install php-mbstring

Peter Mortensen user1284631 3 年前

查找与SRC_编码的筛选器匹配的所有文件
创建它们的备份
将它们转换为DST_编码

#!/bin/bash -xe

SRC_ENCODING="iso-8859-1"
DST_ENCODING="utf-8"
FILTER="*.java"

echo "Find all files that match the encoding $SRC_ENCODING and filter $FILTER"
FOUND_FILES=$(find . -iname "$FILTER" -exec file -i {} \; | grep "$SRC_ENCODING" | grep -Eo '^.*\.java')

for FILE in $FOUND_FILES ; do
    ORIGINAL_FILE="$FILE.$SRC_ENCODING.bkp"
    echo "Backup original file to $ORIGINAL_FILE"
    mv "$FILE" "$ORIGINAL_FILE"

    echo "converting $FILE from $SRC_ENCODING to $DST_ENCODING"
    iconv -f "$SRC_ENCODING" -t "$DST_ENCODING" "$ORIGINAL_FILE" -o "$FILE"
done

echo "Deleting backups"
find . -iname "*.$SRC_ENCODING.bkp" -exec rm {} \;

Daniel Faure 6 年前

可以使用file命令提取单个文件的编码。我有一个sample.html文件,其中包含:

$ file sample.html

html:html文档,UTF-8 Unicode文本,具有很长的行

$ file -b sample.html

HTML文档,UTF-8 Unicode文本,具有很长的行

$ file -bi sample.html

$ file -bi sample.html  | awk -F'=' '{print $2 }'

utf-8

Peter Mortensen user1284631 3 年前

在里面 Cygwin

find -type f -name "<FILENAME_GLOB>" | while read <VAR>; do (file -i "$<VAR>"); done

例子:

find -type f -name "*.txt" | while read file; do (file -i "$file"); done

您可以通过管道将其传输到AWK并创建一个 iconv

例子:

find -type f -name "*.txt" | while read file; do (file -i "$file"); done | awk -F[:=] '{print "iconv -f "$3" -t utf8 \""$1"\" > \""$1"_utf8\""}' | bash

-3

manu_v 12 年前

对于Perl,使用Encode::Detect。