代码之家 › 专栏 › 技术社区 › user8188120

基于唯一值的列字符串转换

2d numpy string arrays python

user8188120 · 技术社区 · 6 年前

有没有办法用Python中的有序数字替换二维数组列中的字符串值?

例如,假设您有一个二维数组:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
a
Out[57]: 
array([['A', '0', 'C'],
       ['A', '0.3', 'B'],
       ['D', '1', 'D']], dtype='<U3')

如果我想将第一列中的字符串值“a”、“a”、“d”替换为数字0,0,1和“c”、“b”、“d”替换为0,1,2,是否有一种有效的方法。

了解以下情况可能会有所帮助:

不同列中的替换编号与列无关。也就是说,用数字替换字符串的每一列将以0开头,并增加到该列中唯一值的数目。
上面是一个测试用例,实际数据要大得多,有更多的字符串列。

下面是一个解决这个问题的示例方法,我很快想到:

for  j in range(a.shape[1]):
    b = list(set(a[:,j]))
    length = len(b)
    for i in range(len(b)):
        indices = np.where(a[:,j]==b[i])[0]
        print(indices)
        a[indices,j]=i

然而,这似乎是一种效率低下的实现方法,而且无法区分列中的浮点值或字符串值,默认情况下将值替换为数字字符串:

a
Out[91]: 
array([['1.0', '0.0', '2.0'],
       ['1.0', '1.0', '0.0'],
       ['0.0', '2.0', '1.0']], dtype='<U3')

在这件事上任何帮助都将不胜感激!

2 回复 | 直到 6 年前

dataista 6 年前

你好像想做一个 label encoding .

我可以想到两种选择: pandas.factorize 和 sklearn.preprocessing.LabelEncoder .

使用 `LabelEncoder`

from sklearn.preprocessing import LabelEncoder

b = np.zeros_like(a, np.int) 
for column in range(a.shape[1]):
    b[:, column] = LabelEncoder().fit_transform(a[:, column])

然后 b 将:

array([[0, 0, 1],
       [0, 1, 0],
       [1, 2, 2]])

如果您希望能够返回到原始值,则需要保存编码器。你可以这样做:

from sklearn.preprocessing import LabelEncoder

encoders = {}
b = np.zeros_like(a, np.int)
for column in range(a.shape[1]):
    encoders[column] = LabelEncoder()
    b[:, column] = encoders[column].fit_transform(a[:, column])

现在 encoders[0].classes_ 将有:

array(['A', 'D'], dtype='<U3')

这意味着“a”被映射到 0 和“D” 1 .

最后,如果您执行编码覆盖 a 而不是使用新的矩阵 c ,您将获得整数作为字符串( "1" 而不是 一 )你可以用 astype(int ):

encoders = {}
for column in range(a.shape[1]):
    encoders[column] = LabelEncoder()
    a[:, column] = encoders[column].fit_transform(a[:, column])

# At this point, a will have strings instead of ints because a had type str
# array([['0', '0', '1'],
#       ['0', '1', '0'],
#       ['1', '2', '2']], dtype='<U3')

a = a.astype(int)

# Now `a` is of type int
# array([[0, 0, 1],
#        [0, 1, 0],
#        [1, 2, 2]])

使用 `pd.factorize`

factorize 返回编码列和编码映射,因此如果不关心它,可以避免保存它:

for column in range(a.shape[1]):
    a[:, column], _ = pd.factorize(a[:, column]) # Drop mapping

a = a.astype(int) # same as above, it's of type str
# a is
# array([[0, 0, 1],
#        [0, 1, 0],
#        [1, 2, 2]])

如果要保留编码映射:

mappings = []
for column in range(a.shape[1]):
    a[:, column], mapping = pd.factorize(a[:, column])
    mappings.append(mapping)

a = a.astype(int)

现在 mappings[0] 将具有以下数据:

array(['A', 'D'], dtype=object)

其语义与 编码器[0].类_ 关于sklearn的labelencoder解决方案。

tel 6 年前

你可以用一种有效的方式去做你想做的事情。

基本上,在跟踪集合或dict中观察到的字母的同时,对输入的每一列中的值进行迭代。这与您已经拥有的内容类似,但效率稍高(避免调用 np.where 一方面)

这是一个函数 charToIx 这将满足您的需求:

from collections import defaultdict
from string import ascii_letters

class Ix:
    def __init__(self):
        self._val = 0

    def __call__(self):
        val = self._val
        self._val += 1
        return val

def charToIx(arr, dtype=None, out=None):
    if dtype is None:
        dtype = arr.dtype

    if out is None:
        out = np.zeros(arr.shape, dtype=dtype)

    for incol,outcol in zip(arr.T, out.T):
        ix = Ix()
        cixDict = defaultdict(lambda: ix())
        for i,x in enumerate(incol):
            if x in cixDict or x in ascii_letters:
                outcol[i] = cixDict[x]
            else:
                outcol[i] = x

    return out

调用函数时指定输出数组的类型。所以输出:

a = np.array([['A',0,'C'],['A',0.3,'B'],['D',1,'D']])
print(charToIx(a, dtype=float))

将是 float 数组:

array([[0. , 0. , 0. ],
       [0. , 0.3, 1. ],
       [1. , 1. , 2. ]])

基于唯一值的列字符串转换

使用 LabelEncoder

使用 pd.factorize

使用 `LabelEncoder`

使用 `pd.factorize`