当前位置：首页 > news >正文

人和做网站如何制作微信公众号文章

news 2026/4/22 18:20:11

人和做网站,如何制作微信公众号文章,沈阳商城网站开发,个人建站做淘客网站备案如果想使“用户搜索内容”和“网页文件内容”之间产生联系#xff0c;就应该将“用户搜索内容”和“网页文件”分为很小的单元 #xff08;这个单元就是关键词#xff09;#xff0c;寻找用户搜索单元是否出现在这个文档之中#xff0c;如果出现就证明这个网页文件和用户搜… 如果想使“用户搜索内容”和“网页文件内容”之间产生联系就应该将“用户搜索内容”和“网页文件”分为很小的单元这个单元就是关键词寻找用户搜索单元是否出现在这个文档之中如果出现就证明这个网页文件和用户搜索内容有关系如果该搜索单元在这篇文章中出现的次数较高也就证明这篇文章与搜索内容有很强的相关性这就是权值(weight)。权值可以自己定义比如标题出现一次对应的权值为10内容出现一次对应的权值为5再分别统计标题和文档内容中该搜素单元出现的次数。总权值该搜索单元标题出现的次数*10 文档内容出现的次数*5再将用户所有的搜索单元的总权值加在一起就是这篇文章与用户搜索内容的相关性。我们可以通过每一篇文档的权值去进行排序给用户呈现出最想要的文档内容。如何去存储这些网页文档内容呢网页文档内容有标题网页文档内容 url网址三个部分。所以就需要结构体将他们组织在一起。我们可以选择线性容器进行存储因为线性容器存储的位置就可以代表这篇文章的文档ID。那么现在面临的问题就是用户搜索单元(用户搜索关键词)和文档单元(文档关键词)之间如何建立联系。下面采用正排索引和倒排索引去建立它们之间的关系。建立索引什么是正排索引正排索引就是文档ID与文档之间的关系。正排索引文档ID文档内容0文档11文档2 正排索引的建立就是将文档ID与文档内容之间进行直接关联。如上表所示。那问题来了该如何关联呢我们可以利用线性表如数组数组下标与文档ID正好是对应的我们将解析出来的数据进行提取存放到一个包含标题title)内容(content)url(网址信息)的结构体再将结构体放到数组中这样就建立好了正排索引。什么是倒排索引比如用户搜索菜鸡爱玩分词工具将菜鸡爱玩分为菜鸡和爱玩分别用菜鸡和爱玩去文档中找对应的关键词。再将关键词存在的文档ID 与搜索关键词之间建立关系。关键词唯一性关键词文档ID,权重weigh倒排索引拉链菜鸡文档2文档1爱玩文档2 首先将处理好的数据进行关键词分割用inverted_index是map容器map关键词,倒排索引拉链统计关键词都出现在那些文档中将关键词出现的这些文档放进倒排索引拉链中这就行形成了关键词与文档ID之间的对应关系。从上面表可以看出同一个文档ID是可以出现在不同的倒排索引拉链中的。然而刚开始建立索引的过程是有些慢的很吃系统资源所以关于网页文档内容太大并且服务器资源比较少的话就会建立失败因此前面才会下载Boost库的部分文件也就是网络文件而不是全部文件。虽然这个过程慢但是带来的好处还是不小的因为索引建立过程是不会进行搜索的当建立好之后只要你有搜索内容我就去inverted_index的map容器中进行查找找到对应的倒排索引拉链再返回。当搜索关键词到来时我就在inverted_index中利用关键词去找如果存在这个关键词那所有与这个关键词相关的文档我都找到了如果不存在那真就不存在。这里的搜索关键词可能不止一个搜索者会输入一段搜索语句比如菜鸡爱玩可能会被分成“菜”“鸡”“菜鸡“”爱“玩爱玩”等。正排索引代码 DocInfo *BuildForwardIndex(const std::string line){//1. 解析line字符串切分//line - 3 string, title, content, urlstd::vectorstd::string results;const std::string sep \3; //行内分隔符ns_util::StringUtil::Split(line, results, sep);//ns_util::StringUtil::CutString(line, results, sep);if(results.size() ! 3){return nullptr;}//2. 字符串进行填充到DocIinfoDocInfo doc;doc.title results[0]; //titledoc.content results[1]; //contentdoc.url results[2]; ///urldoc.doc_id forward_index.size(); //先进行保存id在插入对应的id就是当前doc在vector中的下标!//3. 插入到正排索引的vectorforward_index.push_back(std::move(doc)); //doc,html文件内容return forward_index.back();} 正排索引建立好之后将构建好的结构体返回回去交给倒排索引进行构建倒排索引拉链。因为倒排索引的构建需要文档ID,文档标题和文档内容去进行关键词分割还有权值的计算。注意这块不太理解就向后继续看后面整体的构建索引会告诉你为什么这样做。获取正排索引 //根据doc_id找到找到文档内容DocInfo *GetForwardIndex(uint64_t doc_id){if(doc_id forward_index.size()){std::cerr doc_id out range, error! std::endl;return nullptr;}return forward_index[doc_id]; 因为正排索引被构建了所以直接利用文档ID在正排索引拉链存放文档的结构体数组中进行查找就可以了。什么是权值权值决定这篇文档与用户搜索内容之间是否存在关系以及体现出它们之间相关性的强弱因为每篇文章关于一个话题的侧重点不一样所以我们就用权值的大小来区分是否是用户最想要的将文档与搜索关键词之间的关系用关键词出现在标题和文档内容中的次数和自定义权值大小进行相关计算。比如标题出现一次对应的权值为10内容出现一次对应的权值为5再分别统计标题和文档内容中该搜素单元出现的次数。总权值该搜索单元标题出现的次数*10 文档内容出现的次数*5再将用户所有的搜索单元的总权值加在一起就是这篇文章与用户搜索内容的相关性。我们可以通过每一篇文档的权值去进行排序给用户呈现出最想要的文档内容。你认为标题与搜索关键词的相关性大就将标题的权值设置高点同理文档内容也是一样的。倒排索引代码 bool BuildInvertedIndex(const DocInfo doc){//DocInfo{title, content, url, doc_id}//word - 倒排拉链struct word_cnt{int title_cnt;int content_cnt;word_cnt():title_cnt(0), content_cnt(0){}};std::unordered_mapstd::string, word_cnt word_map; //用来暂存词频的映射表//对标题进行分词std::vectorstd::string title_words;ns_util::JiebaUtil::CutString(doc.title, title_words);//if(doc.doc_id 1572){// for(auto s : title_words){// std::cout title: s std::endl;// }//}//对标题进行词频统计for(std::string s : title_words){boost::to_lower(s); //需要统一转化成为小写word_map[s].title_cnt; //如果存在就获取如果不存在就新建}//对文档内容进行分词std::vectorstd::string content_words;ns_util::JiebaUtil::CutString(doc.content, content_words);//if(doc.doc_id 1572){// for(auto s : content_words){// std::cout content: s std::endl;// }//}//对内容进行词频统计for(std::string s : content_words){boost::to_lower(s);word_map[s].content_cnt;}#define X 10 #define Y 1//Hello,hello,HELLOfor(auto word_pair : word_map){InvertedElem item;item.doc_id doc.doc_id;item.word word_pair.first;item.weight X*word_pair.second.title_cnt Y*word_pair.second.content_cnt; //相关性InvertedList inverted_list inverted_index[word_pair.first];inverted_list.push_back(std::move(item));}return true;} 重点代码讲解 1 —— InvertedList inverted_list inverted_index[word_pair.first]; 2 —— inverted_list.push_back(std::move(item)); 倒排索引拉链inverted_index是一个map关键词倒排索引拉链上面代码第一条就是将关键词对应的倒排索引拉链获取到再将新的InvertedElem结构体插到倒排索引拉链中。这两条语句是可以合并的看起来就会有些复杂。经过上述操作于是就成功建立了的关键词和文档ID之间的关系也就是说我输入一段关键词用分词工具将关键词进行分离用分离的关键词在文档标题文档内容也进行了分词中进行查找因为使用了同一套分词工具所以不会出现文档中有该关键词而搜不到的情况。获取倒排索引拉链 //根据关键字string获得倒排拉链InvertedList *GetInvertedList(const std::string word){auto iter inverted_index.find(word);if(iter inverted_index.end()){std::cerr word have no InvertedList std::endl;return nullptr;}return (iter-second);} 在倒排索引构建好之后所有的倒排索引拉链都存放在inverted_index的map容器中只需要提供关键词进行查找即可将找到的倒排索引拉链返回出去。构建索引整合正排索引和倒排索引的构建 //根据去标签格式化之后的文档构建正排和倒排索引//data/raw_html/raw.txtbool BuildIndex(const std::string input) //parse处理完毕的数据交给我{std::ifstream in(input, std::ios::in | std::ios::binary);if(!in.is_open()){std::cerr sorry, input open error std::endl;return false;}std::string line;int count 0;while(std::getline(in, line)){DocInfo * doc BuildForwardIndex(line);if(nullptr doc){std::cerr build line error std::endl; //for deubgcontinue;}BuildInvertedIndex(*doc);count;//if(count % 50 0){//std::cout 当前已经建立的索引文档: count std::endl;LOG(NORMAL, 当前的已经建立的索引文档: std::to_string(count));//}}return true;} 首先将处理好的网页文件读取取进来利用std::ifstream类对文件进行相关操作因为是以\n为间隔将处理好的网页文件进行了分离所以就采用getline(in,line)循环将文件中的数据读取到。首先建立正排索引其次再建立倒排索引因为倒排索引的建立是基于正排索引的。单例模式 Index(){} //但是一定要有函数体不能deleteIndex(const Index) delete;Index operator(const Index) delete;static Index* instance;static std::mutex mtx;public:~Index(){}public:static Index* GetInstance(){if(nullptr instance){mtx.lock();if(nullptr instance){instance new Index();}mtx.unlock();}return instance;} 单例模式就是禁掉这个类的拷贝构造和赋值重载让这个类不能赋给别人所有对象共用一个instance变量因为在多线程模式下会有很用户进行搜素需要加把锁保证临界区资源不被破坏。索引构建模块的整体代码Index.hpp: #pragma once#include iostream #include string #include vector #include fstream #include unordered_map #include mutex #include util.hpp #include log.hppnamespace ns_index{struct DocInfo{std::string title; //文档的标题std::string content; //文档对应的去标签之后的内容std::string url; //官网文档urluint64_t doc_id; //文档的ID暂时先不做过多理解};struct InvertedElem{uint64_t doc_id;std::string word;int weight;InvertedElem():weight(0){}};//倒排拉链typedef std::vectorInvertedElem InvertedList;class Index{private://正排索引的数据结构用数组数组的下标天然是文档的IDstd::vectorDocInfo forward_index; //正排索引//倒排索引一定是一个关键字和一组(个)InvertedElem对应[关键字和倒排拉链的映射关系]std::unordered_mapstd::string, InvertedList inverted_index;private:Index(){} //但是一定要有函数体不能deleteIndex(const Index) delete;Index operator(const Index) delete;static Index* instance;static std::mutex mtx;public:~Index(){}public:static Index* GetInstance(){if(nullptr instance){mtx.lock();if(nullptr instance){instance new Index();}mtx.unlock();}return instance;}//根据doc_id找到找到文档内容DocInfo *GetForwardIndex(uint64_t doc_id){if(doc_id forward_index.size()){std::cerr doc_id out range, error! std::endl;return nullptr;}return forward_index[doc_id];}//根据关键字string获得倒排拉链InvertedList *GetInvertedList(const std::string word){auto iter inverted_index.find(word);if(iter inverted_index.end()){std::cerr word have no InvertedList std::endl;return nullptr;}return (iter-second);}//根据去标签格式化之后的文档构建正排和倒排索引//data/raw_html/raw.txtbool BuildIndex(const std::string input) //parse处理完毕的数据交给我{std::ifstream in(input, std::ios::in | std::ios::binary);if(!in.is_open()){std::cerr sorry, input open error std::endl;return false;}std::string line;int count 0;while(std::getline(in, line)){DocInfo * doc BuildForwardIndex(line);if(nullptr doc){std::cerr build line error std::endl; //for deubgcontinue;}BuildInvertedIndex(*doc);count;//if(count % 50 0){//std::cout 当前已经建立的索引文档: count std::endl;LOG(NORMAL, 当前的已经建立的索引文档: std::to_string(count));//}}return true;}private:DocInfo *BuildForwardIndex(const std::string line){//1. 解析line字符串切分//line - 3 string, title, content, urlstd::vectorstd::string results;const std::string sep \3; //行内分隔符ns_util::StringUtil::Split(line, results, sep);//ns_util::StringUtil::CutString(line, results, sep);if(results.size() ! 3){return nullptr;}//2. 字符串进行填充到DocIinfoDocInfo doc;doc.title results[0]; //titledoc.content results[1]; //contentdoc.url results[2]; ///urldoc.doc_id forward_index.size(); //先进行保存id在插入对应的id就是当前doc在vector中的下标!//3. 插入到正排索引的vectorforward_index.push_back(std::move(doc)); //doc,html文件内容return forward_index.back();}bool BuildInvertedIndex(const DocInfo doc){//DocInfo{title, content, url, doc_id}//word - 倒排拉链struct word_cnt{int title_cnt;int content_cnt;word_cnt():title_cnt(0), content_cnt(0){}};std::unordered_mapstd::string, word_cnt word_map; //用来暂存词频的映射表//对标题进行分词std::vectorstd::string title_words;ns_util::JiebaUtil::CutString(doc.title, title_words);//if(doc.doc_id 1572){// for(auto s : title_words){// std::cout title: s std::endl;// }//}//对标题进行词频统计for(std::string s : title_words){boost::to_lower(s); //需要统一转化成为小写word_map[s].title_cnt; //如果存在就获取如果不存在就新建}//对文档内容进行分词std::vectorstd::string content_words;ns_util::JiebaUtil::CutString(doc.content, content_words);//if(doc.doc_id 1572){// for(auto s : content_words){// std::cout content: s std::endl;// }//}//对内容进行词频统计for(std::string s : content_words){boost::to_lower(s);word_map[s].content_cnt;}#define X 10 #define Y 1//Hello,hello,HELLOfor(auto word_pair : word_map){InvertedElem item;item.doc_id doc.doc_id;item.word word_pair.first;item.weight X*word_pair.second.title_cnt Y*word_pair.second.content_cnt; //相关性InvertedList inverted_list inverted_index[word_pair.first];inverted_list.push_back(std::move(item));}return true;}};Index* Index::instance nullptr;std::mutex Index::mtx; }排序语句是一条lambda表达式你也可以写个仿函数传递给sort系统函数。 //4.[构建]:根据查找出来的结果构建json串 -- jsoncpp --通过jsoncpp完成序列化反序列化Json::Value root;for(auto item : inverted_list_all){ns_index::DocInfo * doc index-GetForwardIndex(item.doc_id);if(nullptr doc){continue;}Json::Value elem;elem[title] doc-title;elem[desc] GetDesc(doc-content, item.words[0]); //content是文档的去标签的结果但是不是我们想要的我们要的是一部分 TODOelem[url] doc-url;//for deubg, for deleteelem[id] (int)item.doc_id;elem[weight] item.weight; //int-stringroot.append(elem);}//Json::StyledWriter writer;Json::FastWriter writer;*json_string writer.write(root);

查看全文

http://www.hkea.cn/news/14371248/