0%

Gensim核心思想

Gensim核心概念

文档(Document)

Gensim中的文档(document)是文本序列(就像python3中的str)类型的对象。

1
document = "Human machine interface for lab abc computer applications"

语料库(Corpus)

一个语料库是许多文档对象的集合。语料库在Gensim中起到两种作用:

  1. 训练模型。模型使用训练语料库来寻找数据中的共同主题(themes and topics),初始化模型的内部参数。
  2. 组织文档。模型训练完后,主题模型可以从新的文档中抽取之前训练语料库中没有的主题。这样的语料库可以为相似性查询(Similarity Queries)建立索引,进行相似性判断,聚类等操作。

实例语料库,由9个文档组成,每个文档都是一个由单个句子组成的字符串。

1
2
3
4
5
6
7
8
9
10
11
text_corpus = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]

代码托管:https://github.com/zhuozhuo233/Gensim-Core-Notes

对语料库设置停用词以及仅出现过一次的单词,将文档分词:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# set停用词
stoplist = set('for a of the and to in'.split(' '))
# 原文档处理(小写化,空格分隔,去除停用词)
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in text_corpus]

# 统计词频
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1

# 仅保留出现过一次以上的词
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
print(processed_corpus)
1
[['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']]

希望把每个单词与一个唯一的数字id联系起来。

1
2
3
4
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)
1
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

因为我们的实验语料库小,所以只有12个不同的标记。对于更大的语料库,字典中包含成百上千个标记是很常见的。

向量(Vector)

为了推测我们语料库中的隐含结构,我们需要一种将文档转化为数学操作的方法——将文本转化为特征向量。

检查语料库中12个独特的id:

1
print(dictionary.token2id)
1
{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

实验对不在原始语料库中的短语进行向量化处理。使用doc2bow字典的方法为该短语创建次代表是,该方法返回单词计数的稀疏表示:

1
2
3
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
1
2
[(0, 1), (1, 1)]
#(字典中的id,出现次数)

独热向量与稀疏向量:

image-20211223105809234

将整个原始语料库转为向量列表

1
2
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
print(bow_corpus)
1
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

模型(Model)

现在我们已经向量化了我们的语料库,可以开始使用模型来对其进行转换。当我们的文档向量化后,模型可以视为两个向量空间之间的转换。当模型读取训练语料库时,它会在训练的过程中学习这种转换的细节。

tf-idf简单例子:

1
2
3
4
5
6
7
from gensim import models
#训练模型
tfidf = models.TfidfModel(bow_corpus)

#转换 "system minors" 字符串
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])
1
2
[(5, 0.5898341626740045), (11, 0.8075244024440723)]
#(标识id,tfidf权重)

创建了模型就可以来做各种很酷的事。如,通过Tfidf转换整个语料库并对其进行索引,为相似性查询做准备:

1
2
from gensim import similarities
index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus],num_features = 12)

将新的query document与语料库中的每一个文档进行相似性比较:

1
2
3
4
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))
1
[(0, 0.0), (1, 0.32448703), (2, 0.41707572), (3, 0.7184812), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]

简简单单排个序,看一下相似度得分

1
2
for document_number,score in sorted(enumerate(sims),key = lambda x:x[1],reverse = True):
print(document_number,score)
1
2
3
4
5
6
7
8
9
3 0.7184812
2 0.41707572
1 0.32448703
0 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0

文档3的得分最高,直观对比一下:

1
2
3
4
5
query document:
'system engineering'

文档3
"System and human system engineering testing of EPS"

语料库与向量空间

从字符串到向量

1
2
3
4
5
6
7
8
9
10
11
documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]

处理原语料库,设置停用词,移除只出现过一次的单词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from pprint import pprint #pretty-printer
from collections import defaultdict

#去除常见词与标记
stoplist = set('for a of the and to in',split())
texts = {
[word for word in document,lower().split() if word not in stoplist]
for document in documents
}

#移除只出现过一次的词
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] +=1

texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]
pprint(texts)
1
2
3
4
5
6
7
8
9
[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]

将文档转换为向量的方式很多,这里使用词袋模型。在这种表示下,每个文档由一个向量表示,每个香料表示为一个问答对,样式如下:

  • Question:system这个词在文档中出现了几次?
  • Answer:1次

仅仅通过他们的id来表示问题。问题与id之间的映射称为字典(dictionary)

1
2
3
4
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.save('./deerwester.dict') #存储字典以供后续使用
print(dictionary)
1
Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

利用gensim,corpora.dictionary.Dictionary类为语料库中的每个词都分配一个数字id。这会扫描文本,收集相关统计数据。最后,处理后的语料库中会有12个不同的词,同时意味着每个文档将由12个数字表示(即一个12维向量)。

查看单词与id之间的映射:

1
print(dictionary.token2id)
1
{'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

将标记好的文档转换为向量:

1
2
3
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec) #单词"interaction"没有出现在字典里,所以被忽略
1
2
[(0, 1), (1, 1)]
#(字典中的id,出现的次数)

doc2bow()简单地计算每个单词的出现次数,将单词转换为其整数id,并将结果以稀疏向量的形式返回。

1
2
3
corpus = [dictionary.doc2bow(text) for text in texts]
corpora.MmCorpus.serialize('./deerwester.mm',corpus) #store the disk,for later use
print(corpus)
1
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

直观对比一下:

1
2
3
4
5
6
7
8
[[(0, 1), (1, 1), (2, 1)], 
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
[(2, 1), (5, 1), (7, 1), (8, 1)],
[(1, 1), (5, 2), (8, 1)],
[(3, 1), (6, 1), (7, 1)],
[(9, 1)], [(9, 1), (10, 1)],
[(9, 1), (10, 1), (11, 1)],
[(4, 1), (10, 1), (11, 1)]]
1
2
3
4
5
6
7
8
9
[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]

语料库格式

向量空间序列化到硬盘有很多种格式。Gensim使用的流式语料库接口是惰性的,它是对文档一个一个地操作而不是对整个语料库。

矩阵市场格式(Market Matrix format)是一种重要的文件格式,以这种格式存储语料库:

1
2
corpus = [[1,0.5],[]]
corpora.MmCorpus.serialize('./corpus.mm',corpus)

打开mm文件看一下:

1
2
3
%%MatrixMarket matrix coordinate real general
2 2 1
1 2 0.5

存储为其他格式:

1
2
3
corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

从市场矩阵格式加载语料库迭代器:

1
corpus = corpora.MmCorpus('/tmp/corpus.mm')

语料库对象是流,所以通常不能直接打印:

1
print(corpus)
1
MmCorpus(2 documents, 2 features, 1 non-zero entries)

打印语料库的方法:将语料库全部加载到内存中进行打印

1
print(list(corpus)) #list()将任何序列转换为python的list
1
[[(1, 0.5)], []]

主题和转换

创建语料库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from collections import defaultdict
from gensim import corpora

documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
[word for word in document.lower().split() if word not in stoplist]
for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1

texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
1
print(corpus)
1
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

创建转换

1
2
from gensim import models
tfidf = model.TfidfModel(corpus) #step 1 --初始化模型

不同的转换可能需要不同的初始化参数;在Tfidf下,训练是计算语料库中所有文档特征的频率。潜在语义分析(LSA)或隐狄利克雷(LDA)可能会需要更多的操作,也需要更多的时间。

转换向量

从现在开始,tfidf被视为只读对象,可以用任何向量从旧表示(词袋整数计数)转换为新表示(Tfidf实值权重)

1
2
doc_bow = [(0,1),(1,1)]
print(tfidf[doc_bow]) #step 2 --使用模型转化向量
1
[(0, 0.7071067811865476), (1, 0.7071067811865476)]

将转换应用到整个语料库:

1
2
3
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
print(doc)
1
2
3
4
5
6
7
8
9
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.3244870206138555), (6, 0.44424552527467476), (7, 0.3244870206138555)]
[(2, 0.5710059809418182), (5, 0.4170757362022777), (7, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(3, 0.6282580468670046), (6, 0.6282580468670046), (7, 0.45889394536615247)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(4, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]

转换也可以序列化:

1
2
lsi_model = models.LsiModel(corpus_tfidf,id2word = dictionary,num_topics=2) #初始化一个LSI转换。
corpus_lsi = lsi_model[corpus_tfidf] #在原语料库上创建一个双重装饰器:bow->tfidf->fold-in-lsi

现在,我们通过潜在语义索引将我们的Tfidf语料库转换为潜在的二维空间。检查一下:

1
lsi_model.print_topics(2)
1
2
3
4
[(0,
'0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + 0.061*"system" + 0.060*"time" + 0.060*"response" + 0.058*"user" + 0.049*"computer" + 0.035*"interface"'),
(1,
'-0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"time" + -0.320*"response" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees"')]

根据LSI:前五个文档与第二个主题的相关性更强,其余四个文档与第一个主题的相关性更强:

1
2
for doc,as_text in zip(corpus_lsi,documents):
print(doc,as_text)
1
2
3
4
5
6
7
8
9
[(0, 0.06600783396090243), (1, -0.5200703306361859)] Human machine interface for lab abc computer applications
[(0, 0.19667592859142288), (1, -0.7609563167700061)] A survey of user opinion of computer system response time
[(0, 0.08992639972446298), (1, -0.724186062675251)] The EPS user interface management system
[(0, 0.07585847652178054), (1, -0.6320551586003424)] System and human system engineering testing of EPS
[(0, 0.1015029918497996), (1, -0.5737308483002966)] Relation of user perceived response time to error measurement
[(0, 0.7032108939378314), (1, 0.16115180214025712)] The generation of random binary unordered trees
[(0, 0.8774787673119835), (1, 0.16758906864659295)] The intersection graph of paths in trees
[(0, 0.9098624686818582), (1, 0.14086553628718873)] Graph minors IV Widths of trees and well quasi ordering
[(0, 0.6165825350569278), (1, -0.05392907566389511)] Graph minors A survey

模型持久化

通过save()和load()实现

1
2
3
4
5
6
7
8
9
import os
import tempfile

with tempfile.NamedTemporaryFile(prefix='model-', suffix='.lsi', delete=False) as tmp:
lsi_model.save(tmp.name)

loaded_lsi_model = models.LsiModel.load(tmp.name)

os.unlink(tmp.name)

可用替换

Gensim实现了集中流行的向量空间模型算法:

  • Term Frequency * Inverse Document Frequency,Tf-Idf

    1
    model = models.TfidfModel(corpus, normalize=True)
  • 潜在语义索引(Latent Semantic Indexing,LSI(or sometimes LSA))

    1
    model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

    LSI的独特之处在于只要提供更多的培训文档就可以继续培训。这是通过对底层模型增量更新来完成的。过程称为在线训练(online training)。由于这个特性,输入文档流甚至可能是无限的。

    1
    2
    3
    4
    5
    model.add_documents(another_tfidf_corpus)  # LSI基于tfidf_corpus + another_tfidf_corpus进行训练
    lsi_vec = model[tfidf_vec] #在不影响模型的情况下添加新文档进入LSI空间

    model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
    lsi_vec = model[tfidf_vec]
  • 随机投影(Random Projections,RP)

    1
    model = models.RpModel(tfidf_corpus, num_topics=500)
  • 隐狄利克雷分布(Latent Dirichlet Allocation,LDA)

    1
    model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
  • 层次狄利克雷过程(Hierarchical Dirichlet Process, HDP)

    1
    model = models.HdpModel(corpus, id2word=dictionary)

相似性查询

创建语料库

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from collections import defaultdict
from gensim import corpora

documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
[word for word in document.lower().split() if word not in stoplist]
for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1

texts = [
[token for token in text if frequency[token] > 1]
for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
1
print(corpus)
1
[[(0, 1), (1, 1), (2, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)], [(2, 1), (5, 1), (7, 1), (8, 1)], [(1, 1), (5, 2), (8, 1)], [(3, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(4, 1), (10, 1), (11, 1)]]

相似性接口

1
2
3
4
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # 转换
print(vec_lsi)
1
[(0, 0.46182100453271546), (1, -0.0700276652789991)]

使用余弦相似度来确定两个向量的相似度。余弦相似度是向量空间建模中的标准度量,但在向量表示概率分布的地方,不同的相似度度量可能会更合适。

初始化查询结构

1
2
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus]) # 将语料库转换到LSI空间并对其进行索引

索引持久性通过标准save()和load()函数处理:

1
2
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

执行查询

查询我们新文档与九个索引文档的相似性:

1
2
sims = index[vec_lsi] #对语料库进行相似性查询
print(list(enumerate(sims)))
1
[(0, 0.998093), (1, 0.93748635), (2, 0.9984453), (3, 0.9865886), (4, 0.90755945), (5, -0.12416792), (6, -0.10639259), (7, -0.09879464), (8, 0.050041765)]
1
print(sims)
1
[(0, (2, 0.9984453)), (1, (0, 0.998093)), (2, (3, 0.9865886)), (3, (1, 0.93748635)), (4, (4, 0.90755945)), (5, (8, 0.050041765)), (6, (7, -0.09879464)), (7, (6, -0.10639259)), (8, (5, -0.12416792))]

使用python magic包操作一下,将相似度按降序排列,查询“Human computer interaction”的最终结果:

1
2
3
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i, s in enumerate(sims):
print(s, documents[i])
1
2
3
4
5
6
7
8
9
(2, 0.9984453) Human machine interface for lab abc computer applications
(0, 0.998093) A survey of user opinion of computer system response time
(3, 0.9865886) The EPS user interface management system
(1, 0.93748635) System and human system engineering testing of EPS
(4, 0.90755945) Relation of user perceived response time to error measurement
(8, 0.050041765) The generation of random binary unordered trees
(7, -0.09879464) The intersection graph of paths in trees
(6, -0.10639259) Graph minors IV Widths of trees and well quasi ordering
(5, -0.12416792) Graph minors A survey