当前位置：首页 > news >正文

购房者网站建设网站要先给钱才能做

news 2026/4/22 8:28:37

购房者网站,建设网站要先给钱才能做,银川公司做网站,网站如何在google提交收录一、说明 NLP 项目使用文本#xff0c;但机器学习算法不能使用文本#xff0c;除非将其转换为数字表示。这种表示通常称为向量#xff0c;它可以应用于文本的任何合理单位#xff1a;单个标记、n-gram、句子、段落#xff0c;甚至整个文档。在整个语料库的统计 NLP 中但机器学习算法不能使用文本除非将其转换为数字表示。这种表示通常称为向量它可以应用于文本的任何合理单位单个标记、n-gram、句子、段落甚至整个文档。在整个语料库的统计 NLP 中应用了不同的向量化技术例如 one-hot、计数或频率编码。在神经 NLP 中词向量也称为词嵌入占主导地位。可以使用预先训练的向量以及复杂神经网络中学习的向量表示。本文解释并展示了所有提到的向量化技术的 Python 实现one-hot 编码、计数器编码词袋、词频以及最后的词向量。本文的技术背景是和几个Python v3.11附加库gensim v4.3.1、pandas v2.0.1、numpy v1.26.1和nltk v3.8.1。scikit-learn v1.2.2所有示例也应该适用于较新的库版本。本文最初出现在我的博客admantium.com上。二、要求和使用的 Python 库请务必阅读并运行我上一篇文章的要求以便拥有 Jupyter Notebook 来运行所有代码示例。对于本文需要以下库 Collections Counter用于计算文档中标记数量的对象 Gensim 该downloader对象允许加载多个预先训练的词向量 Pandas DataFrame用于存储文本、标记和向量的对象 Numpy 创建和使用的几种方法arrays NLTK PlaintextCorpusReader用于提供对文档的访问、提供标记化方法并计算有关所有文件的统计信息的可遍历对象sent_tokenizer并word_tokenizer用于生成令牌stopwords代币减持清单 SciKitLearn Pipeline对象来实现处理步骤链BaseEstimator并TransformerMixin构建代表管道步骤的自定义类所有示例都需要这些导入和基类 import numpy as np import re from copy import deepcopy from collections import Counter from gensim import downloader from nltk.corpus import stopwords from nltk.corpus.reader.plaintext import PlaintextCorpusReader from nltk.tokenize import sent_tokenize, word_tokenize from sklearn.base import BaseEstimator, TransformerMixin from time import timeclass SciKitTransformer(BaseEstimator, TransformerMixin):def fit(self, XNone, yNone):return selfdef transform(self, XNone):return self 三、基本示例根据之前的文章NLTK PlaintextCorpusReader 将被重用。这是该类的更新版本WikipediaCorpus带有一个附加filter()方法 - 它将词汇表减少为仅文本没有任何停用词。 class WikipediaCorpus(PlaintextCorpusReader):def __init__(self, root_path):PlaintextCorpusReader.__init__(self, root_path, r.*[0-9].txt)def filter(self, word):#only keep letters, numbers, and sentence delimiterword re.sub([\(\)\.,;:\--], , word)#remove multiple whitespaceword re.sub(r\s, , word)if not word in stopwords.words(english):return word.lower()return def vocab(self):return sorted(set([self.filter(word) for word in corpus.words()]))def max_words(self):max 0for doc in self.fileids():l len(self.words(doc))max l if l max else maxreturn maxdef describe(self, fileidsNone, categoriesNone):started time()return {files: len(self.fileids()),paras: len(self.paras()),sents: len(self.sents()),words: len(self.words()),vocab: len(self.vocab()),max_words: self.max_words(),time: time()-started} 为了使本文中的示例向量简短易懂该语料库由维基百科有关机器学习的文章的前三个句子组成。 _Source: [Wikipedia](https://en.wikipedia.org/wiki/Artificial_intelligence)_Artificial intelligence (AI) is intelligence-perceiving, synthesizing, and inferring information-demonstrated by machines, as opposed to intelligence displayed by humans or by other animals. Example tasks in which this is done include speech recognition, computer vision, translation between (natural) languages, as well as other mappings of inputs. As machines become increasingly capable, tasks considered to require intelligence are often removed from the definition of AI, a phenomenon known as the AI effect. For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology. 使用语料库类来解析这些句子得到以下统计数据词汇量为 49 个单词总单词数为 113 个。它的大小足以让下面的解释保持简短。 corpus WikipediaCorpus(ai_sentences)print(corpus.fileids()) # [sent1.txt, sent2.txt, sent3.txt]print(corpus.describe()) # {files: 3, paras: 3, sents: 3, words: 91, vocab: 40, max_words: 32, time: 0.01642608642578125}print(corpus.vocab()) # [, ai, animals, artificial, as, become, capable, computer, considered, ..., well] 四、一次性编码 one-hot 编码基于所有文档的总词汇量来表示单词在特定文档中出现的关系。因此实施需要以下步骤计算所有文档的总有序词汇表迭代每个文档并标记出现的单词以下实现构建一个vocab_dict填充有默认浮点值的对象0.0然后将这些值设置1.0为出现在句子中的每个标记。 class OneHotEncoder(SciKitTransformer):def __init__(self, vocab):self.vocab_dict dict.fromkeys(vocab, 0.0)def one_hot_vector(self, tokens):vec_dict deepcopy(self.vocab_dict)for token in tokens:if token in self.vocab_dict:vec_dict[token] 1.0vec [v for v in vec_dict.values()]return np.array(vec) 以下是前两个句子的 one-hot 向量 encoder OneHotEncoder(corpus.vocab())sent1 [word for word in word_tokenize(corpus.raw(sent1.txt))] vec1 encoder.one_hot_vector(sent1)print(vec1) # [0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. # 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]print(vec1.shape) # (40,)sent2 [word for word in word_tokenize(corpus.raw(sent2.txt))] vec2 encoder.one_hot_vector(sent2)print(vec2) # [0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. # 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 1.]print(vec2.shape) # (40,) 五、计数器编码计数器编码是创建向量的中间形式。基于所有文档的完整有序词汇表确定文档中所有单词的数量和出现次数。该数字通常按比例缩放例如按文档的长度。这是 Python 中的计数器编码实现。和以前一样它构建一个vocab_dict填充有默认浮点值的对象0.0并为每个文档设置一个值number(word)/len(document)。 from collections import Counterclass CountEncoder(SciKitTransformer):def __init__(self, vocab):self.vocab dict.fromkeys(vocab, 0.0)def count_vector(self, tokens):vec_dict deepcopy(self.vocab)token_vec Counter(tokens)doc_length len(tokens)for token, count in token_vec.items():if token in self.vocab:vec_dict[token] count/doc_lengthvec [v for v in vec_dict.values()]return np.array(vec) 使用计数器编码会产生以下结果 encoder CountEncoder(corpus.vocab())sent1 [word for word in word_tokenize(corpus.raw(sent1.txt))] vec1 encoder.count_vector(sent1)print(vec1) # [0. 0. 0.03571429 0. 0.03571429 0. # 0. 0. 0. 0. 0. 0.03571429 # 0. 0. 0. 0.03571429 0. 0. # 0.03571429 0. 0. 0.07142857 0. 0. # 0.03571429 0. 0. 0. 0.03571429 0. # 0. 0. 0. 0. 0. 0.03571429 # 0. 0. 0. 0. ]print(vec1.shape) # (40,) sent2 [word for word in word_tokenize(corpus.raw(sent2.txt))] vec2 encoder.count_vector(sent2)print(vec2) # [0. 0. 0. 0. 0.06896552 0. # 0. 0.03448276 0. 0. 0. 0. # 0.03448276 0. 0. 0. 0.03448276 0. # 0. 0. 0.03448276 0. 0. 0.03448276 # 0. 0.03448276 0.03448276 0. 0. 0. # 0. 0.03448276 0. 0. 0.03448276 0. # 0.03448276 0.03448276 0.03448276 0.03448276]print(vec2.shape) # (40,) 六、词频编码前两种编码导致的问题是当与机器学习算法一起使用时非常罕见的术语没有足够的权重来发挥重要作用。特别是为了解决这个问题术语频率、术语间接频率指标平衡了大型文档语料库中的罕见术语。详细的数学可以在TfIdf 维基百科文章中研究- 以下是基本摘要 TF术语频率是术语在文档中出现的次数除以文档的总长度以伪代码表示word_occurences_in_doc/doc_lenIDF间接文档频率是包含某个单词的文档数除以语料库中文档总数的对数以伪代码表示log(number_of_docs/number_of_docs_containing_word) 实现非常复杂根据以下考虑因素构建编码器以列表形式接收语料库词汇并接收以下形式的字典对象{document_name: [tokens]}否则此实现将与语料库对象耦合得太紧在初始化过程中会创建一个word_frequency字典其中包含某个术语在所有文档中出现的频率总数TfIdf 方法确定文档总数为number_of_docs文档长度为doc_len。Counter然后它为文档中的所有单词创建一个 TfIdf值然后为词汇表中包含的每个单词计算 TfIdf 值所有值都转换为 Numpy 数组并返回这是实现 class TfIdfEncoder(SciKitTransformer):def __init__(self, doc_arr, vocab):self.doc_arr doc_arrself.vocab vocabself.word_frequency self._word_frequency()def _word_frequency(self):word_frequency dict.fromkeys(self.vocab, 0.0)for doc_name in self.doc_arr:doc_words Counter([word for word in self.doc_arr[doc_name]])for word, _ in doc_words.items():if word in self.vocab:word_frequency[word] 1.0return word_frequencydef TfIdf_vector(self, doc_name):if not doc_name in self.doc_arr:print(fDocument {doc_name} not found.)returnnumber_of_docs len(self.doc_arr)doc_len len(self.doc_arr[doc_name])doc_words Counter([word for word in self.doc_arr[doc_name]])TfIdf_vec dict.fromkeys(self.vocab, 0.0)for word, word_count in doc_words.items():if word in self.vocab:tf word_count/doc_lenidf np.log(number_of_docs/self.word_frequency[word])idf 1 if idf 0 else idfTfIdf_vec[word] tf * idfvec [v for v in TfIdf_vec.values()]return np.array(vec) 对于我们只有三个句子的示例向量足以表示文档但它们的全部潜力只有在大型校园中才能实现。 doc_list [doc for doc in corpus.fileids()] words_list [corpus.words(doc) for doc in [doc for doc in corpus.fileids()]] doc_arr dict(zip(doc_list, words_list))encoder TfIdfEncoder(doc_arr, corpus.vocab()) vec1 encoder.TfIdf_vector(sent1.txt)print(vec1) # [0. 0. 0.03433163 0. 0.03125 0. # 0. 0. 0. 0. 0.03433163 0.03433163 # 0. 0. 0. 0.03433163 0. 0. # 0.03433163 0.03433163 0. 0.03801235 0. 0. # 0.01267078 0. 0. 0. 0.03433163 0.03433163 # 0. 0. 0. 0. 0. 0.03433163 # 0. 0. 0. 0. ]print(vec1.shape) # (40,) vec2 encoder.TfIdf_vector(sent2.txt)print(vec2) # [0. 0. 0. 0. 0.06896552 0. # 0. 0.03788318 0. 0. 0. 0. # 0.03788318 0. 0. 0. 0.03788318 0. # 0. 0. 0.03788318 0. 0. 0.03788318 # 0. 0.03788318 0.03788318 0. 0. 0. # 0. 0.03788318 0. 0. 0.03788318 0. # 0.01398156 0.03788318 0.03788318 0.03788318]print(vec2.shape) # (40,) 七、词向量最终的编码类型是词向量。本质上每个单词都用一个 n 维向量表示。该向量表示单词之间的细粒度关系并且它使向量算术能够进行向量的比较和组合例如满足的向量代数king women queen。词向量为大规模自然语言处理任务提供了巨大且令人惊讶的价值。三个主要的词向量实现是原始的 Word2Vec、FastText 和 Glove。 Word2Vec是第一个模型根据新闻文章进行训练并使用不同的 n-gram 大小来捕获周围上下文中单词的含义。FastText使用类似的连续 n 元语法方法但它不仅考虑训练数据中单词的实际上下文还考虑其他上下文。这改善了稀疏单词的表示并处理训练期间不存在的未知单词。Glove考虑整个语料库根据训练数据计算词与词的共现矩阵并构建一个关于采样数据中任何词出现的可能性的概率模型。词向量表示训练数据中出现的结构。如果该数据足够大并且接近语料库的文本则可以使用预训练的向量。否则他们需要在校园内接受培训。在下面的实现中Gensim库将用于加载预训练的Word2Vec向量并将其应用到语料库中。要使用预训练模型之一您需要使用 Gensim 助手下载其模型。请注意模型可能非常大。例如word2vec-google-news-300模型为 1.6GB为每个单词提供 300 维向量。 wv downloader.load(word2vec-google-news-300) # [-------------------------------------------] 15.5% 258.5/1662.8MB downloaded 矢量化器实现使用与其他结构相同的已知结构。它的实现非常简单它将处理文档标记列表并输出一个向量其中包含存在向量表示的每个单词的数值。 class Word2VecEncoder(SciKitTransformer):def __init__(self, vocab):self.vocab vocabself.vector_lookup downloader.load(word2vec-google-news-300)def word_vector(self, tokens):vec np.array([])for token in tokens:if token in self.vocab:if token in self.vector_lookup:print(fAdd {token})vec np.append(self.vector_lookup[token], vec)return vec 这是一个示例输出。 encoder Word2VecEncoder(corpus.vocab())sent1 [word for word in word_tokenize(corpus.raw(sent1.txt))] vec1 encoder.word_vector(sent1)print(vec1) # [ 0.01989746 0.24707031 -0.23632812 ... -0.24707031 0.05249023 # 0.19824219]print(vec1.shape) # (3000,)sent2 [word for word in word_tokenize(corpus.raw(sent2.txt))] vec2 encoder.word_vector(sent2)print(vec2) # [-0.11767578 -0.13769531 -0.140625 ... -0.03295898 -0.01733398 # 0.13476562]print(vec2.shape) # (4500,) 正如您所看到的两个句子的向量分别为 3000 和 4500 个值。结果是特定于文档的矩阵其中每列代表按原样出现的文档标记列数是列中包含的单词数。八、结论本文展示了如何从头开始实现文本矢量化方法。它展示了 one-hot 编码、计数器编码、TfIdf 频率编码以及 Word2Vec 词向量的实现。它还展示了将所得向量应用于维基百科有关人工智能的文章中的句子时的具体示例。参考资料 NLP: Text Vectorization Methods from Scratch | by Sebastian | Oct, 2023 | Medium

查看全文

http://www.hkea.cn/news/14365730/