代码之家  ›  专栏  ›  技术社区  ›  EJoshuaS - Stand with Ukraine

最近邻分类算法的NumPy实现以完全相同的方式对所有事物进行分类

  •  0
  • EJoshuaS - Stand with Ukraine  · 技术社区  · 6 年前

    我的培训和测试数据都在CSV中,如下所示:

    4.6,3.6,1.0,0.2,Iris-setosa
    5.1,3.3,1.7,0.5,Iris-setosa
    4.8,3.4,1.9,0.2,Iris-setosa
    7.0,3.2,4.7,1.4,Iris-versicolor
    6.4,3.2,4.5,1.5,Iris-versicolor
    6.9,3.1,4.9,1.5,Iris-versicolor
    5.5,2.3,4.0,1.3,Iris-versicolor
    

    我知道如何做基本算法。下面是我为它创建的C#:

    namespace Project_3_Prototype
    {
        public class FourD
        {
            public double f1, f2, f3, f4;
    
            public string name;
    
            public static double Distance(FourD a, FourD b)
            {
                double squared = Math.Pow(a.f1 - b.f1, 2) + Math.Pow(a.f2 - b.f2, 2) + Math.Pow(a.f3 - b.f3, 2) + Math.Pow(a.f4 - b.f4, 2);
    
                return Math.Sqrt(squared);
            }
        }
    
        class Program
        {
            static void Main(string[] args)
            {
                List<FourD> distances = new List<FourD>();
    
                using (var parser = new TextFieldParser("iris-training-data.csv"))
                {
                    parser.SetDelimiters(",");
    
                    while (!parser.EndOfData)
                    {
                        string[] fields = parser.ReadFields();
    
                        var curr = new FourD
                        {
                            f1 = double.Parse(fields[0]),
                            f2 = double.Parse(fields[1]),
                            f3 = double.Parse(fields[2]),
                            f4 = double.Parse(fields[3]),
                            name = fields[4]
                        };
    
                        distances.Add(curr);
                    }
                }
    
                double correct = 0, total = 0;
    
                using (var parser = new TextFieldParser("iris-testing-data.csv"))
                {
                    parser.SetDelimiters(",");
    
                    int i = 1;
    
                    while (!parser.EndOfData)
                    {
                        total++;
                        string[] fields = parser.ReadFields();
    
                        var curr = new FourD
                        {
                            f1 = double.Parse(fields[0]),
                            f2 = double.Parse(fields[1]),
                            f3 = double.Parse(fields[2]),
                            f4 = double.Parse(fields[3]),
                            name = fields[4]
                        };
    
                        FourD min = distances[0];
    
                        foreach (FourD comp in distances)
                        {
                            if (FourD.Distance(comp, curr) < FourD.Distance(min, curr))
                            {
                                min = comp;
                            }
                        }
    
                        if (min.name == curr.name)
                        {
                            correct++;
                        }
    
                        Console.WriteLine(string.Format("{0},{1},{2}", i, curr.name, min.name));
    
                        i++;
                    }
                }
    
                Console.WriteLine("Accuracy: " + correct / total);
    
                Console.ReadLine();
            }
        }
    }
    

    # The format is Number,Correct label,Predicted Label
    1,Iris-setosa,Iris-setosa
    2,Iris-setosa,Iris-setosa
    3,Iris-setosa,Iris-setosa
    4,Iris-setosa,Iris-setosa
    5,Iris-setosa,Iris-setosa
    6,Iris-setosa,Iris-setosa
    7,Iris-setosa,Iris-setosa
    8,Iris-setosa,Iris-setosa
    9,Iris-setosa,Iris-setosa
    10,Iris-setosa,Iris-setosa
    11,Iris-setosa,Iris-setosa
    12,Iris-setosa,Iris-setosa
    ...
    
    Accuracy: 0.946666666666667
    

    我正试图在努比做同样的事情。但是,任务不允许我使用 for 循环,仅矢量化函数。

    所以,基本上我想做的是:对于测试数据中的每一行,获取训练数据中最接近它的行的索引(即具有最小欧氏距离)。

    以下是我在Python中尝试的内容:

    import numpy as np
    
    def main():    
        # Split each line of the CSV into a list of attributes and labels
        data = [x.split(',') for x in open("iris-training-data.csv")]
    
        # The last item is the label
        labels = np.array([x[-1].rstrip() for x in data])
    
        # Convert the first 3 items to a 2D array of floats
        floats = np.array([x[0:3] for x in data]).astype(float)
    
        classifyTrainingExamples(labels, floats)
    
    def classifyTrainingExamples(labels, floats):
        # We're basically doing the same thing to the testing data that we did to the training data
        testingData = [x.split(',') for x in open("iris-testing-data.csv")]
    
        testingLabels = np.array([x[-1].rstrip() for x in testingData])
    
        testingFloats = np.array([x[0:3] for x in testingData]).astype(float)
    
        res = np.apply_along_axis(lambda x: closest(floats, x), 1, testingFloats)
    
        correct = 0
    
        for number, index in enumerate(res):    
            if labels[index] == testingLabels[number]:
                correct += 1
    
            print("{},{},{}".format(number + 1, testingLabels[number], labels[index]))
    
            number += 1
    
        print(correct / len(list(res)))
    
    def closest(otherArray, item):
        res = np.apply_along_axis(lambda x: distance(x, item), 1, otherArray)
    
        i = np.argmin(res)
    
        return i
    
    # Get the Euclidean distance between two "flat" lists (i.e. one particular row
    def distance(a, b):
        # Subtract one from the other elementwise, then raise each one to the power of 2
        lst = (a - b) ** 2
    
        # Sum all of the elements together, and take the square root
        result = np.sqrt(lst.sum())
    
        return result
    
    main()
    

    不幸的是,输出看起来像

    1,Iris-setosa,Iris-setosa
    2,Iris-setosa,Iris-setosa
    3,Iris-setosa,Iris-setosa
    4,Iris-setosa,Iris-setosa
    ....
    74,Iris-setosa,Iris-setosa
    75,Iris-setosa,Iris-setosa
    0.93333333
    

    每一行都只有一句话 Iris-setosa 对于标签,准确度为0.9333333。

    每一个 if 语句(但正确率仍然显示为0.93333)。

    所以基本上:

    • 这表明每一个结果都是“正确的”(当它显然不是的时候)。
    • 刚毛鸢尾
    • 我的百分比显示为93%。正确的值实际上大约是94%,但我希望它显示100%,因为每个结果都应该是“正确的”

    有人能帮我看看我缺少什么吗?

    在有人问之前,为了记录,是的,我试着用调试器一步一步地完成:)同样为了记录,是的,这是家庭作业。

    1 回复  |  直到 6 年前
        1
  •  2
  •   Abby    6 年前

    如果您真的想在一行中完成,下面是您可以做的事情(我从scikit learn下载了数据集):

    import numpy as np
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    
    # Load dataset
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    # Split training and test set
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
    # 1-neareast neighbour    
    ypred = np.array([ytrain[np.argmin(np.sum((x-Xtrain)**2,axis=1))] for x in Xtest])
    # Compute classification error
    sum(ypred != ytest)/ len(ytest)
    

    # k-neareast neighbour    
    k = 3
    ypredk = np.array([np.argmax(np.bincount(ytrain[np.argsort(np.sum((x-Xtrain)**2,axis=1))[0:k]])) for x in Xtest])
    sum(ypredk != ytest)/ len(ytest)
    

    换句话说,对距离进行排序,找到k个最小值的索引(这是 np.argsort np.argmax(np.bincount(x)) 部分)。

    最后,如果您想确定,您可以与 scikit-learn

    # scikit-learn NN
    from sklearn import neighbors
    knn = neighbors.KNeighborsClassifier(n_neighbors=k, algorithm='ball_tree')
    knn.fit(Xtrain,ytrain)
    ypred_sklearn = knn.predict(Xtest)
    sum(ypred_sklearn != ytest)/ len(ytest)