Skip to content

中文长文本分类、短句子分类、多标签分类、两句子相似度(Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,FastText,TextCNN,CharCNN,TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN

License

Notifications You must be signed in to change notification settings

yongzhuo/Keras-TextClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyPI Build Status PyPI_downloads Stars Forks Join the chat at https://gitter.im/yongzhuo/Keras-TextClassification

Install(安装)

pip install Keras-TextClassification
step2: download and unzip the dir of 'data.rar', 地址: 链接https://pan.baidu.com/s/1pIDzGaGXCZ7cjng1XU_kPA   提取码w6ps   压缩包密码: 2022
       cover the dir of data to anaconda, like '/anaconda/3.5.1/envs/tensorflow13/Lib/site-packages/keras_textclassification/data'
step3: goto # Train&Usage(调用) and Predict&Usage(调用)

keras_textclassification(代码主体,未完待续...)

- Electra-fineture(todo)
- Albert-fineture
- Xlnet-fineture
- Bert-fineture
- FastText
- TextCNN
- charCNN
- TextRNN
- TextRCNN
- TextDCNN
- TextDPCNN
- TextVDCNN
- TextCRNN
- DeepMoji
- SelfAttention
- HAN
- CapsuleNet
- Transformer-encode
- SWEM
- LEAM
- TextGCN(todo)

run(运行, 以FastText为例)

- 1. 进入keras_textclassification/m01_FastText目录,
- 2. 训练: 运行 train.py,   例如: python train.py
- 3. 预测: 运行 predict.py, 例如: python predict.py
- 说明: 默认不带pre train的random embedding,训练和验证语料只有100条,完整语料移步下面data查看下载

run(多标签分类/Embedding/test/sample实例)

- bert,word2vec,random样例在test/目录下, 注意word2vec(char or word), random-word,  bert(chinese_L-12_H-768_A-12)未全部加载,需要下载
- multi_multi_class/目录下以text-cnn为例进行多标签分类实例,转化为multi-onehot标签类别,分类则取一定阀值的类
- sentence_similarity/目录下以bert为例进行两个句子文本相似度计算,数据格式如data/sim_webank/目录下所示
- predict_bert_text_cnn.py
- tet_char_bert_embedding.py
- tet_char_bert_embedding.py
- tet_char_xlnet_embedding.py
- tet_char_random_embedding.py
- tet_char_word2vec_embedding.py
- tet_word_random_embedding.py
- tet_word_word2vec_embedding.py

keras_textclassification/data

- 数据下载
  ** github项目中只是上传部分数据,需要的前往链接: https://pan.baidu.com/s/1I3vydhmFEQ9nuPG2fDou8Q 提取码: rket
- baidu_qa_2019(百度qa问答语料,只取title作为分类样本,17个类,有一个是空'',已经压缩上传)
   - baike_qa_train.csv
   - baike_qa_valid.csv
- byte_multi_news(今日头条2018新闻标题多标签语料,1070个标签,fate233爬取, 地址为: [byte_multi_news](https://github.com/fate233/toutiao-multilevel-text-classfication-dataset))
   -labels.csv
   -train.csv
   -valid.csv
- embeddings
   - chinese_L-12_H-768_A-12/(取谷歌预训练好点的模型,已经压缩上传,
                              keras-bert还可以加载百度版ernie(需转换,[https://github.com/ArthurRizar/tensorflow_ernie](https://github.com/ArthurRizar/tensorflow_ernie)),
                              哈工大版bert-wwm(tf框架,[https://github.com/ymcui/Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm))
   - albert_base_zh/(brightmart训练的albert, 地址为https://github.com/brightmart/albert_zh)
   - chinese_xlnet_base_L-12_H-768_A-12/(哈工大预训练的中文xlnet模型[https://github.com/ymcui/Chinese-PreTrained-XLNet],12层)
   - term_char.txt(已经上传, 项目中已全, wiki字典, 还可以用新华字典什么的)
   - term_word.txt(未上传, 项目中只有部分, 可参考词向量的)
   - w2v_model_merge_short.vec(未上传, 项目中只有部分, 词向量, 可以用自己的)
   - w2v_model_wiki_char.vec(已上传百度网盘, 项目中只有部分, 自己训练的维基百科字向量, 可以用自己的)
- model
   - fast_text/预训练模型存放地址

项目说明

    1. 构建了base基类(网络(graph)、向量嵌入(词、字、句子embedding)),后边的具体模型继承它们,代码简单
    1. keras_layers存放一些常用的layer, conf存放项目数据、模型的地址, data存放数据和语料, data_preprocess为数据预处理模块,

模型与论文paper题与地址

参考/感谢

训练简单调用:

from keras_textclassification import train
train(graph='TextCNN', # 必填, 算法名, 可选"ALBERT","BERT","XLNET","FASTTEXT","TEXTCNN","CHARCNN",
                       # "TEXTRNN","RCNN","DCNN","DPCNN","VDCNN","CRNN","DEEPMOJI",
                       # "SELFATTENTION", "HAN","CAPSULE","TRANSFORMER"
     label=17,         # 必填, 类别数, 训练集和测试集合必须一样
     path_train_data=None, # 必填, 训练数据文件, csv格式, 必须含'label,ques'头文件, 详见keras_textclassification/data
     path_dev_data=None, # 必填, 测试数据文件, csv格式, 必须含'label,ques'头文件, 详见keras_textclassification/data
     rate=1,             # 可填, 训练数据选取比例
     hyper_parameters=None) # 可填, json格式, 超参数, 默认embedding为'char','random'

Reference

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:

@misc{Keras-TextClassification,
    howpublished = {\url{https://github.com/yongzhuo/Keras-TextClassification}},
    title = {Keras-TextClassification},
    author = {Yongzhuo Mo},
    publisher = {GitHub},
    year = {2019}
}

*希望对你有所帮助!

About

中文长文本分类、短句子分类、多标签分类、两句子相似度(Chinese Text Classification of Keras NLP, multi-label classify, or sentence classify, long or short),字词句向量嵌入层(embeddings)和网络层(graph)构建基类,FastText,TextCNN,CharCNN,TextRNN, RCNN, DCNN, DPCNN, VDCNN, CRNN, Bert, Xlnet, Albert, Attention, DeepMoji, HAN, 胶囊网络-CapsuleNet, Transformer-encode, Seq2seq, SWEM, LEAM, TextGCN

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy