当前位置：首页 > news >正文

做网站后台数据库建设百度帐号注册

news 2026/4/8 20:23:12

做网站后台数据库建设,百度帐号注册,巩义网站网站建设,wap网站生成小程序目录分割连续变量标准化连续变量分类分割连续变量我们经常处理高度非线性的连续特征，而且只用一个系数很难拟合到我们的模型中。在这种情况下，可能很难只通过一个系数来解释这样一个特征与目标之间的关系。有时，将值划分到离散的桶中…

分割连续变量

标准化连续变量

分类

分割连续变量

我们经常处理高度非线性的连续特征，而且只用一个系数很难拟合到我们的模型中。
在这种情况下，可能很难只通过一个系数来解释这样一个特征与目标之间的关系。有时，将值划分到离散的桶中是有用的。

首先，让我们使用以下代码创建一些伪造数据：

import numpy as np
x = np.arange(0, 100)
x = x / 100.0 * np.pi * 4
y = x * np.sin(x / 1.764) + 20.1234

现在，我们可以通过以下代码创建一个 DataFrame：

schema = typ.StructType([typ.StructField('continuous_var', typ.DoubleType(), False)
])
data = spark.createDataFrame([[float(e), ] for e in y], schema=schema)

接下来，我们将使用 QuantileDiscretizer 模型将我们的连续变量分割成五个桶（numBuckets 参数）：

discretizer = ft.QuantileDiscretizer(numBuckets=5, inputCol='continuous_var', outputCol='discretized')

让我们看看我们得到了什么：

data_discretized = discretizer.fit(data).transform(data)

我们的函数现在看起来如下：

现在我们可以将这个变量当作分类变量，并使用 OneHotEncoder 进行编码，以便将来使用。

标准化连续变量

标准化连续变量不仅有助于更好地理解特征之间的关系（因为解释系数变得更容易），而且还有助于计算效率，并防止陷入一些数值陷阱。以下是如何在 PySpark ML 中进行操作。

首先，我们需要创建我们的连续变量的向量表示（因为它只是一个单独的浮点数）：

vectorizer = ft.VectorAssembler(inputCols=['continuous_var'], outputCol= 'continuous_vec')

接下来，我们构建我们的标准化器和管道。通过将 withMean 和 withStd 设置为 True，该方法将去除均值，并将方差缩放到单位长度：

normalizer = ft.StandardScaler(inputCol=vectorizer.getOutputCol(), outputCol='normalized', withMean=True,withStd=True
)
pipeline = Pipeline(stages=[vectorizer, normalizer])
data_standardized = pipeline.fit(data).transform(data)

这是转换后的数据的样子：

如你所见，数据现在围绕 0 振荡，具有单位方差（绿线）。

分类

到目前为止，我们只使用了 PySpark ML 中的 LogisticRegression 模型。在这一部分，我们将使用 RandomForestClassifier 再次模拟婴儿的生存机会。

在我们可以做到这一点之前，我们需要将标签特征转换为 DoubleType：

import pyspark.sql.functions as func
births = births.withColumn('INFANT_ALIVE_AT_REPORT', func.col('INFANT_ALIVE_AT_REPORT').cast(typ.DoubleType())
)
births_train, births_test = births \.randomSplit([0.7, 0.3], seed=666)

现在我们已经将标签转换为双精度，我们准备构建我们的模型。我们以与之前类似的方式进行，区别是我们将重用本章早期的编码器和 featureCreator。numTrees 参数指定应该有多少决策树在我们的随机森林中，maxDepth 参数限制了树的深度：

classifier = cl.RandomForestClassifier(numTrees=5, maxDepth=5, labelCol='INFANT_ALIVE_AT_REPORT')
pipeline = Pipeline(stages=[encoder,featuresCreator, classifier])
model = pipeline.fit(births_train)
test = model.transform(births_test)

现在让我们来看看 RandomForestClassifier 模型与 LogisticRegression 模型相比表现如何：

evaluator = ev.BinaryClassificationEvaluator(labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test, {evaluator.metricName: "areaUnderROC"}))
print(evaluator.evaluate(test, {evaluator.metricName: "areaUnderPR"}))

我们得到以下结果：

嗯，正如你看到的，结果比逻辑回归模型好大约 3 个百分点。让我们测试一下单棵树的模型表现如何：

classifier = cl.DecisionTreeClassifier(maxDepth=5, labelCol='INFANT_ALIVE_AT_REPORT')
pipeline = Pipeline(stages=[encoder,featuresCreator, classifier])
model = pipeline.fit(births_train)
test = model.transform(births_test)
evaluator = ev.BinaryClassificationEvaluator(labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test, {evaluator.metricName: "areaUnderROC"}))
print(evaluator.evaluate(test, {evaluator.metricName: "areaUnderPR"}))

前面的代码给出了以下结果：