代码之家  ›  专栏  ›  技术社区  ›  Juned Ansari

如何从一个独立列预测多个相关列

  •  0
  • Juned Ansari  · 技术社区  · 3 年前

    是否可以从独立列中预测多个相关列?

    问题说明: 我必须在STATUS列的基础上预测5个因素(cEXT、cNEU、cAGR、cCON、cOPN),因此输入变量将仅为STATUS列,目标变量为(cEXT,cNEU,cAGR、cCON、cOPN)。

    enter image description here

    在上面的数据中,STATUS是一个独立列,cEXT、cNEU、cAGR、cCON、cOPN是依赖列,我如何预测这些列?

    # independent and dependent variable split
    X = df[['STATUS']]
    y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
    

    现在我只预测一列,所以重复同样的事情5次,所以我为5个目标变量创建5个模型。

    密码

    X = df[['STATUS']]
    y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
    
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
    
    
    from sklearn.compose import ColumnTransformer
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
    
    ct = ColumnTransformer([
        ('step1', TfidfVectorizer(), 'STATUS')
    ],remainder='drop')
    
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
    from sklearn import metrics 
    from sklearn.pipeline import Pipeline
    
    # ########## 
    # RandomForest
    # ##########
    model = Pipeline([
            ('column_transformers', ct),
            ('model', RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto')),
        ])
    
    # creating 5 models, can I create 1 model?
    model_cEXT = model.fit(X_train, y_train['cEXT'])
    model_cNEU = model.fit(X_train, y_train['cNEU'])
    model_cAGR = model.fit(X_train, y_train['cAGR'])
    model_cCON = model.fit(X_train, y_train['cCON'])
    model_cOPN = model.fit(X_train, y_train['cOPN'])
    
    0 回复  |  直到 3 年前
        1
  •  3
  •   Zalak Bhalani    3 年前

    您可以使用scikit learn中的多输出分类器。

    from sklearn.multioutput import MultiOutputClassifier
    from sklearn.ensemble import RandomForestClassifier
    
    clf = MultiOutputClassifier(RandomForestClassifier()).fit(X_train, y_train)
    clf.predict(X_test)
    

    参考 Official document of MultiOutputClassifier

        2
  •  1
  •   Abhishek Prajapat    3 年前

    有一个图书馆 scikit-multilearn 这对于这些任务是非常好的。有几种方法可以进行多标签分类,例如 PowerSet , ClassifierChain 等等。这些在这个图书馆里都有很好的介绍。

    以下是它将如何替换当前代码的示例。

    X = df[['STATUS']]
    y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]]
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
    
    # Rest of your code
    ==========================
    # The new code
    
    from skmultilearn.problem_transform import BinaryRelevance
    from scipy.sparse import csr_matrix
    
    
    
    classifier = BinaryRelevance(
        classifier = RandomForestClassifier(criterion = 'gini', n_estimators=100, n_jobs = -1, class_weight = 'balanced', max_features = 'auto'),
        require_dense = [False, True]
    )
    
    model = Pipeline([
            ('column_transformers', ct),
            ('classifier', classifier),
        ])
    
    model.fit(X_train, y_train.values)
    res = model.predict(X_test)
    res = csr_matrix(res)
    res.todense()
    

    你可以探索其他方法 here .

    在TensorFlow中,您可以使用 sigmoid 激活和 binaryCE 所有机组的损失。如下所示:

    import tensorflow as tf
    from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
    
    tfidf_calculator = TextVectorization(
                      standardize = 'lower_and_strip_punctuation',
                      split       = 'whitespace',
                      max_tokens  = 100,
                      output_mode ='tf-idf',
                      pad_to_max_tokens=False)
    
    tfidf_calculator.adapt(df['Status'].values)
    
    tfids = tfidf_calculator(df['Status'])
    
    X = tfids.numpy()
    y = df[["cEXT","cNEU","cAGR","cCON","cOPN"]].values
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=5)
    
    model = tf.keras.Sequential([
        tf.keras.layers.InputLayer(input_shape=(100,)),
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(5, activation='sigmoid')
    ])
    
    model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy())
    
    model.fit(X_train, y_train, epochs=20, batch_size=32)
    

    在TensorFlow中需要注意的是,您需要一个密集矩阵作为输入。可能有一种方法可以使用稀疏,但我没有找到。