当前位置：首页 > news >正文

南宁3及分销网站制作公司徽标设计图片

news 2026/4/24 10:27:33

南宁3及分销网站制作,公司徽标设计图片,高质量网站外链建设大揭秘,重庆职业能力建设投稿网站目录一、PCA简介二、数据集概览三、数据预处理步骤四、PCA申请五、KMeans 聚类六、PCA成分分析七、逆变换八、质心分析九、结论十、深入探究 10.1 第 1 步#xff1a;确定 PCA 组件的最佳数量 10.2 第 2 步#xff1a;使用 9 个组件重做 PCA 10.3 解释 PCA 加载和特… 目录一、PCA简介二、数据集概览三、数据预处理步骤四、PCA申请五、KMeans 聚类六、PCA成分分析七、逆变换八、质心分析九、结论十、深入探究 10.1 第 1 步确定 PCA 组件的最佳数量 10.2 第 2 步使用 9 个组件重做 PCA 10.3 解释 PCA 加载和特征贡献 10.4 9项常设仲裁法院的分析与解读 10.5 如何进行主题分析一、PCA简介主成分分析 PCA 是一种统计技术可简化高维数据的复杂性同时保留趋势和模式。它通过将数据转换为较少的维度来实现此目的这些维度充当特征的摘要称为主成分 PC。这些分量彼此正交确保它们表示数据中的独立方差。二、数据集概览在我们的案例研究中我们使用的是 Airbnb 房源数据集其中包含位置、房间类型、价格等各种功能。我们的目标是发现这个数据集中的潜在模式这可以帮助我们将列表细分为有意义的组。 import pandas as pd from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.decomposition import PCA from sklearn.cluster import KMeans# Load the dataframe from the CSV file df pd.read_csv(https://raw.githubusercontent.com/fenago/datasets/main/airbnb.csv) 三、数据预处理步骤在深入研究 PCA 之前我们需要确保我们的数据是干净的并且采用正确的分析格式缺失值我们通过用各自列的平均值填充缺失值来处理缺失值确保没有遗漏任何数据点。分类编码我们使用标签编码将分类变量如、、和转换为数字而该特征是一次性编码的。此步骤至关重要因为 PCA 需要数字输入。host_is_superhostneighbourhoodproperty_typeinstant_bookablecity功能扩展我们过去常常扩展功能。缩放对于 PCA 至关重要因为它对初始变量的方差很敏感。StandardScaler # Fill missing values with the mean of the column df_filled df.fillna(df.mean())# Convert categorical columns to numeric using label encoding # Initialize label encoder label_encoder LabelEncoder()# Columns to label encode label_encode_columns [host_is_superhost, neighbourhood, property_type, instant_bookable]# Apply label encoding to each column for column in label_encode_columns:df_filled[column] label_encoder.fit_transform(df_filled[column])# Apply one-hot encoding to city using get_dummies df_filled pd.get_dummies(df_filled, columns[city])# Redefine and refit the scaler to the current dataset scaler StandardScaler() scaled_features scaler.fit_transform(df_filled) 四、PCA申请将 PCA 应用于我们的缩放数据集我们决定了三个主要组成部分。这个数字通常是根据解释的方差来选择的方差表示每个组件从数据中捕获的信息量。 # Apply PCA pca PCA(n_components3) pca_result pca.fit_transform(scaled_features) 五、KMeans 聚类由于我们的数据现在位于三维PCA空间中我们应用KMeans聚类来识别四个不同的聚类。此方法对数据点进行分组以便每个聚类中的数据点彼此之间比其他聚类中的数据点更相似。 # Apply KMeans clustering on the PCA result kmeans_pca KMeans(n_clusters4, random_state42) kmeans_pca.fit(pca_result) 六、PCA成分分析每个主成分都代表了原始特征的组合但它们究竟捕获了什么 # Get the PCA components (loadings) pca_components pca.components_ 让我们深入研究每个 PCA 的负载 PC1似乎很重视地理坐标纬度和经度表明该组成部分可能代表列表的地理分布。PC2该组件与 host_since_datekey 负相关表明它可能正在捕获主机经验或任期的某些方面。PC3由于内住物和listing_size_sqft的负载较高该组件可以反映列表的大小和容量。七、逆变换通过逆变换 PCA 聚类中心我们将聚类映射回原始空间以根据原始特征解释质心。这一步就像将我们的 PCA 结果翻译回我们可以理解的语言。 # Inverse transform the cluster centers from PCA space back to the original feature space original_space_centroids scaler.inverse_transform(pca.inverse_transform(kmeans_pca.cluster_centers_))# Create a new DataFrame for the inverse transformed cluster centers with column names centroids_df pd.DataFrame(original_space_centroids, columnsdf_filled.columns)# Calculate the mean of the original data for comparison original_means df_filled.mean(axis0)# Prepare the PCA loadings DataFrame pca_loadings_df pd.DataFrame(pca_components, columnsdf_filled.columns, index[fPC{i1} for i in range(3)]) 八、质心分析与原始数据的平均值相比聚类的质心告诉我们每个聚类的中心趋势。例如如果质心的价格值高于平均值则相应的聚类可能表示更多的优质列表。 # Append the mean of the original data to the centroids for comparison centroids_comparison_df centroids_df.append(original_means, ignore_indexTrue)# Store the PCA loadings and centroids comparison DataFrame for further analysis pca_loadings_df.to_csv(/mnt/data/pca_loadings.csv, indexTrue) centroids_comparison_df.to_csv(/mnt/data/centroids_comparison.csv, indexFalse)pca_loadings_df, centroids_comparison_df.head() # Displaying the PCA loadings and the first few rows of the centroids comparison DataFrame 九、结论 PCA使我们能够降低数据集的维度揭示最初并不明显的内在模式。当与聚类相结合时我们可以将房源细分为不同的组每个组代表Airbnb市场的不同方面。十、深入探究 10.1 第 1 步确定 PCA 组件的最佳数量当我们执行 PCA 时我们将原始特征集转换为一组新的正交特征称为主成分 PC。每个主成分捕获数据集中总方差的一定百分比。第一个主成分捕获的方差最大每个后续组件捕获的方差较小。通过查看累积解释方差我们可以看到随着我们包含越来越多的分量捕获了多少总方差。累积解释方差图显示了通过包含最多 n 个主成分来捕获的数据集总方差的比例。这个想法是选择最少数量的主成分这些主成分仍捕获总方差的很大一部分。一个常见的经验法则是选择足够的组件来捕获至少 95% 的总方差这使我们能够在保留数据集中大部分信息的同时降低维度。让我们重新审视累积解释方差图以确定满足此条件的分量数。我们将查找累积解释方差超过 95% 的点这通常被认为足以捕获数据集中的大部分信息。这种组件数量通常是信息保留和降维之间的良好平衡。我们将再次分析情节并提供更直观的解释。 # Fit PCA to the data without reducing dimensions and compute the explained variance ratio pca_full PCA() pca_full.fit(scaled_features)# Calculate the cumulative explained variance ratio explained_variance_ratio pca_full.explained_variance_ratio_ cumulative_explained_variance explained_variance_ratio.cumsum()# Plot the cumulative explained variance ratio to find the optimal number of components plt.figure(figsize(10, 6)) plt.plot(range(1, len(cumulative_explained_variance) 1), cumulative_explained_variance, markero, linestyle--) plt.title(Cumulative Explained Variance by PCA Components) plt.xlabel(Number of PCA Components) plt.ylabel(Cumulative Explained Variance) plt.grid(True) plt.axhline(y0.95, colorr, linestyle-) # 95% variance line for reference plt.text(0.5, 0.85, 95% cut-off threshold, color red, fontsize16)# Determine the number of components that explain at least 95% of the variance optimal_num_components len(cumulative_explained_variance[cumulative_explained_variance 0.95]) 1# Highlight the optimal number of components on the plot plt.axvline(xoptimal_num_components, colorg, linestyle--) plt.text(optimal_num_components 1, 0.6, fOptimal Components: {optimal_num_components}, color green, fontsize14)plt.show()# Returning the optimal number of components optimal_num_components 更新后的图更清楚地说明了累积解释方差如何随着主成分数量的增加而增加。绿色垂直线标记分量数共同解释数据集中总方差的至少 95% 的点。从图中可以看出这个阈值有 9 个主成分。这意味着通过使用 9 个分量我们可以捕获数据中 95% 的可变性这通常被认为足以满足许多应用的需求。这比原始特征数量大幅减少同时仍保留了大部分信息。因此在我们的分析上下文中我们可以执行 PCA 并将维度降低到 9 个主成分而不是使用所有原始特征以实现更简单但仍然信息丰富的数据集表示。 10.2 第 2 步使用 9 个组件重做 PCA # Redo PCA with 9 components pca_9 PCA(n_components9) pca_result_9 pca_9.fit_transform(scaled_features)# Get the PCA loadings for 9 components pca_components_9 pca_9.components_# Analyze the PCA loadings to determine which features contribute most to each of the 9 principal components # We will look at the absolute values of the loadings to assess their contribution pca_loadings_analysis_9 pd.DataFrame(pca_components_9, columnsdf_filled.columns, index[fPC{i1} for i in range(9)] ).abs().T # Transpose to have features as rows# Sorting the loadings for each component to see the top contributing features top_features_per_pc_9 pca_loadings_analysis_9.apply(lambda s: s.abs().nlargest(5).index.tolist(), axis0)top_features_per_pc_9 为了提供前九个主成分中每个特征的实际加载分数我们可以输出 PCA 加载矩阵的数值。将显示给定主成分中每个特征的载荷显示它们的贡献。 # Display the actual loadings for the top 5 features for each of the 9 principal components # We will extract the top 5 features based on the absolute values of the loadings and display their actual loadings import numpy as np # Function to get top n features for each principal component with their loadings def get_top_features_loadings(pca_loadings, n_features):top_features {}for i in range(pca_loadings.shape[0]):# Get the index of the n largest absolute values in the i-th principal componenttop_indices np.argsort(np.abs(pca_loadings[i]))[-n_features:]# Create a dictionary of the top features and their loadings for the i-th principal componenttop_features[fPC{i1}] {df_filled.columns[j]: pca_loadings[i][j] for j in top_indices}return top_features# Get the top 5 features and their loadings for each of the 9 principal components top_features_loadings_9 get_top_features_loadings(pca_components_9, 5) top_features_loadings_9_df pd.DataFrame(top_features_loadings_9).Ttop_features_loadings_9_df 上表显示了前九个主组件中每个主组件的顶部特征的实际载荷。载荷是表示每个特征对主成分的贡献程度的系数。以下是每个主要组件的主要贡献功能及其负载的摘要 PC1地理特征和城市的影响最大载荷显示正负值在地图上表示相反的方向。PC2与主机相关的功能如具有高负负载这意味着这些功能与 PC2 有很强的反比关系。host_since_datekeyhost_idPC3与属性相关的特征如、和具有很强的正载荷这意味着它们直接影响 PC3。accommodateslisting_size_sqftbedroomsPC4 到 PC9与城市、物业类型、预订选项和评论分数相关的各种其他功能有助于这些组件具有不同程度的正负负载。要解释这些负载请执行以下操作正载荷意味着随着特征值的增加主成分的分数也会增加。负载荷意味着随着特征值的增加主成分的分数会降低。载荷的大小距零的距离表示特征与主成分之间关系的强度。要执行详细分析并推断每个 PCA 的含义需要考虑数据集的领域知识并了解每个主要功能与 Airbnb 列表的上下文之间的关系。这涉及考虑每个功能所代表的内容例如位置、物业大小、房东体验以及它们如何组合在一起以形成由主组件表示的主题。我们已经成功地对 9 个主要组件执行了 PCA并列出了对每个组件贡献最大的前 5 个功能。以下是我们如何解释负载以确定特征贡献 10.3 解释 PCA 加载和特征贡献 PCA 组件的载荷反映了原始变量与主成分之间的相关性。以下是解释这些负载的方法高正载荷接近 1表示特征与元件具有很强的正相关。高负载荷接近 -1表示特征与元件具有很强的负关联。加载接近 0表示特征与组件的关联较弱。每个主成分的主要贡献特征是具有最高绝对载荷的特征无论它们是正载荷还是负载荷。这些特征被认为对组件的方差影响最大。 10.4 9项常设仲裁法院的分析与解读现在我们将根据主要贡献功能来解释 9 个主要组件中每个组件的主题 PC1以城市相关特征和地理坐标为主暗示了地理位置的主题。PC2受房东标识符和日期的影响表示房东经历或任期的主题。PC3包括与房源面积和容量相关的功能指向房产面积和住宿容量的主题。PC4具有与城市相关的变量和接受率暗示了托管偏好和位置可取性的主题。PC5以城市和价格为标志可能反映不同地点的定价策略主题。PC6包含即时可预订和房东超赞房东状态建议以出租服务和设施为主题。PC7以回复率和评论分数为特色指向房东响应能力和客人满意度的主题。PC8还包括房东总房源和评价分数表明房东组合和体验质量的主题。PC9捕获邻域和主机列表计数这可能表示邻域受欢迎程度和主机活动。 10.5 如何进行主题分析要对PCA组件进行专题分析对 PCA 负载进行排序按每个主组件的加载对特征进行排序。识别主要特征确定具有最高绝对载荷的顶级特征。了解要素重要性了解这些要素在数据集上下文中的重要性。寻找模式在顶级特征中寻找模式以推断主题。考虑正负贡献请注意具有高正载荷的特征和具有高负载荷的特征对主题的贡献不同。验证主题使用领域知识或其他数据分析来验证推断的主题。

查看全文

http://www.hkea.cn/news/14393669/