前言

思路

因为数据是爬虫爬下来的，具体内容被写入到了excel表里，所以文本预处理分为2块。一个是从excel中获取数据，然后去掉文本中所有的html标签，最后整理成txt文档中一行一条评论的形式。另一个是对文本去停用词、分词，处理成一个词一个空格的形式，便于word2vec训练模型。

代码与解释

pre_process_format.py

import re
import os
from openpyxl import load_workbook


def read_from_xlsx(path):
    wb = load_workbook(path)
    ws = wb[wb.sheetnames[0]]
    rows = ws.max_row
    cols = ws.max_column
    for row in range(2, rows + 1):
        with open("书评\\format\\" + ws.cell(row, 1).value + ".txt", 'w',
                  encoding='utf-8') as book_file:
            # book_file.write(ws.cell(row, 1).value + "\n")
            contents = []
            for col in range(7, cols + 1):
                content = ws.cell(row, col).value
                if str(content) == 'None':
                    continue
                content = str(re.sub("<[^>]+>", " ", content))
                content = str(re.sub("\n", " ", content))
                # print(content)
                contents.append(content + '\n')
            book_file.writelines(contents)


if __name__ == '__main__':
    xlsxBase = "书评\\xlsx\\"
    xlsxs = os.listdir(xlsxBase)
    for xlsx in xlsxs:
        read_from_xlsx(xlsxBase+xlsx)

代码使用了openpyxl包，主要是读取数据，因为excel第一行是表头，第二行第7列开始才是评论主体，所以行号列号需要稍微规定一下。

读取一个cell的数据，使用正则表达式"<[^>]+>"替换所有html标签为空格，然后替换所有换行为空格，最后写入文本时，在末尾加上换行符即可。

pre_process_segment.py

import os
import time
import jieba.posseg as pseg


def seg_book(book_base, book_name, outfile_path):
    infile = open(book_base + book_name, 'r', encoding='utf-8')
    outfile = open(outfile_path + "seg_" + book_name, 'w', encoding='utf-8')
    for line in infile:
        line = line.strip()
        # print(line)
        words = pseg.cut(line)
        for word, flag in words:
            if flag.startswith('x'):
                continue
            if word in cn_stopwords_set | en_stopwords_set:
                continue
            outfile.write(word + ' ')
        outfile.write('\n')
    outfile.close()
    infile.close()


if __name__ == '__main__':
    cn_stopwords_file = open("util\\stopwords_csdn.txt", 'r', encoding='utf-8')
    en_stopwords_file = open("util\\stopwords_google.txt", 'r',
                             encoding='utf-8')
    cn_stopwords_set = set(cn_stopwords_file.read().splitlines())
    en_stopwords_set = set(en_stopwords_file.read().splitlines())
    start = time.time()
    infileBase = "书评\\format\\"
    books = os.listdir(infileBase)
    for book in books:
        print(book + " 分词中...")
        seg_book(infileBase, book, "书评\\seg\\")
    # seg_book(infileBase, "追风筝的人.txt", "书评\\seg\\")
    end = time.time()
    print("共计用时: %d seconds" % (end - start))

做的是原始数据的去停用词和分词处理，使用了jieba分词，去掉了标点，停用词使用的google英文停用词和csdn某博客提供的中文停用词。

后期会考虑使用tf-idf来动态去除停用词。

总结

其实后来训练了word2vec模型发现，很多结果不尽人意，比如“中”这个字没有去除掉，而这个字单独出现意味着它表示英文中的 in，应当放入停用词当中。

是的，停用词表不一定简单的使用别人列出来的，知乎上查到的比较合理的做法：去除其中很常见的停用词，然后使用tf-idf或者人工筛选去除另一部分。因为我们是每一本书一个模型来迭代获取书评标签，所以没办法为每一本书人工筛选，后期再使用tf-idf筛选一波吧。先做出一个快速原型才是重中之重。

陈小睿同学的技术博客和闲聊杂谈。

书蕴笔记-0-文本预处理

前言

思路

代码与解释

pre_process_format.py

pre_process_segment.py

总结

欢迎分享