当前位置: 首页 > news >正文

广东网站建设价格宣传推广计划

广东网站建设价格,宣传推广计划,java视频网站开发技术,做题网站中计算多项式的值怎么做文章目录 一、使用 Java API 和 JavaRDD<Row> 在 Spark SQL 中向数据帧添加新列二、foreachPartition 遍历 Dataset三、Dataset 自定义 Partitioner四、Dataset 重分区并且获取分区数 一、使用 Java API 和 JavaRDD 在 Spark SQL 中向数据帧添加新列 在应用 mapPartition…

文章目录

      • 一、使用 Java API 和 JavaRDD<Row> 在 Spark SQL 中向数据帧添加新列
      • 二、foreachPartition 遍历 Dataset
      • 三、Dataset 自定义 Partitioner
      • 四、Dataset 重分区并且获取分区数

一、使用 Java API 和 JavaRDD 在 Spark SQL 中向数据帧添加新列

  在应用 mapPartition 函数后创建一个新的数据框:

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;public class Handler implements Serializable {public void handler(Dataset<Row> sourceData) {Dataset<Row> rowDataset = sourceData.where("rowKey = 'abcdefg_123'").selectExpr("split(rowKey, '_')[0] as id","name","time").where("name = '小强'").orderBy(functions.col("id").asc(), functions.col("time").desc());FlatMapFunction<Iterator<Row>,Row> mapPartitonstoTime = rows->{Int count = 0; // 只能在每个分区内自增,不能保证全局自增String startTime = "";String endTime = "";List<Row> mappedRows=new ArrayList<Row>();while(rows.hasNext()){count++;Row next = rows.next();String id = next.getAs("id");if (count == 2) {startTime = next.getAs("time");endTime = next.getAs("time");}Row mappedRow= RowFactory.create(next.getString(0), next.getString(1), next.getString(2), endTime, startTime);mappedRows.add(mappedRow);}return mappedRows.iterator();};JavaRDD<Row> sensorDataDoubleRDD=rowDataset.toJavaRDD().mapPartitions(mapPartitonstoTime);StructType oldSchema=rowDataset.schema();StructType newSchema =oldSchema.add("startTime",DataTypes.StringType,false).add("endTime",DataTypes.StringType,false);System.out.println("The new schema is: ");newSchema.printTreeString();System.out.println("The old schema is: ");oldSchema.printTreeString();Dataset<Row> sensorDataDoubleDF=spark.createDataFrame(sensorDataDoubleRDD, newSchema);sensorDataDoubleDF.show(100, false);}
}

打印结果:

The new schema is: 
root|-- id: string (nullable = true)|-- name: string (nullable = true)|-- time: string (nullable = true)The old schema is: 
root|-- id: string (nullable = true)|-- name: string (nullable = true)|-- time: string (nullable = true)|-- startTime: string (nullable = true)|-- endTime: string (nullable = true)+-----------+---------+----------+----------+----------+
|id         |name     |time      |startTime |endTime   |
+-----------+---------+----------+----------+----------+
|abcdefg_123|xiaoqiang|1693462023|1693462023|1693462023|
|abcdefg_321|xiaoliu  |1693462028|1693462028|1693462028|
+-----------+---------+----------+----------+----------+

参考:
java - 使用 Java API 和 JavaRDD 在 Spark SQL 中向数据帧添加新列
java.util.Arrays$ArrayList cannot be cast to java.util.Iterator

二、foreachPartition 遍历 Dataset

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;import java.io.IOException;
import java.io.Serializable;
import java.util.Iterator;public class Handler implements Serializable {public void handler(Dataset<Row> sourceData) {JavaRDD<Row> dataRDD = rowDataset.toJavaRDD();dataRDD.foreachPartition(new VoidFunction<Iterator<Row>>() {@Overridepublic void call(Iterator<Row> rowIterator) throws Exception {while (rowIterator.hasNext()) {Row next = rowIterator.next();String id = next.getAs("id");if (id.equals("123")) {String startTime = next.getAs("time");// 其他业务逻辑}}}});// 转换为 lambda 表达式dataRDD.foreachPartition((VoidFunction<Iterator<Row>>) rowIterator -> {while (rowIterator.hasNext()) {Row next = rowIterator.next();String id = next.getAs("id");if (id.equals("123")) {String startTime = next.getAs("time");// 其他业务逻辑}}});}
}

三、Dataset 自定义 Partitioner

参考:spark 自定义 partitioner 分区 java 版

import org.apache.commons.collections.CollectionUtils;
import org.apache.spark.Partitioner;
import org.junit.Assert;import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;/*** Created by lesly.lai on 2018/7/25.*/
public class CuxGroupPartitioner extends Partitioner {private int partitions;/*** map<key, partitionIndex>* 主要为了区分不同分区*/private Map<Object, Integer> hashCodePartitionIndexMap = new ConcurrentHashMap<>();public CuxGroupPartitioner(List<Object> groupList) {int size = groupList.size();this.partitions = size;initMap(partitions, groupList);}private void initMap(int size, List<Object> groupList) {Assert.assertTrue(CollectionUtils.isNotEmpty(groupList));for (int i=0; i<size; i++) {hashCodePartitionIndexMap.put(groupList.get(i), i);}}@Overridepublic int numPartitions() {return partitions;}@Overridepublic int getPartition(Object key) {return hashCodePartitionIndexMap.get(key);}public boolean equals(Object obj) {if (obj instanceof CuxGroupPartitioner) {return ((CuxGroupPartitioner) obj).partitions == partitions;}return false;}
}

查看分区分布情况工具类:
(1)Scala:

import org.apache.spark.sql.{Dataset, Row}/*** Created by lesly.lai on 2017/12FeeTask/25.*/
class SparkRddTaskInfo {def getTask(dataSet: Dataset[Row]) {val size = dataSet.rdd.partitions.lengthprintln(s"==> partition size: $size " )import scala.collection.Iteratorval showElements = (it: Iterator[Row]) => {val ns = it.toSeqimport org.apache.spark.TaskContextval pid = TaskContext.get.partitionIdprintln(s"[partition: $pid][size: ${ns.size}] ${ns.mkString(" ")}")}dataSet.foreachPartition(showElements)}
}

(2)Java:

import org.apache.spark.TaskContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;public class SparkRddTaskInfo {public static void getTask(Dataset<Row> dataSet) {int size = dataSet.rdd().partitions().length;System.out.println("==> partition size:" + size);JavaRDD<Row> dataRDD = dataSet.toJavaRDD();dataRDD.foreachPartition((VoidFunction<Iterator<Row>>) rowIterator -> {List<String> mappedRows = new ArrayList<String>();int count = 0;while (rowIterator.hasNext()) {Row next = rowIterator.next();String id = next.getAs("id");String partitionKey = next.getAs("partition_key");String name = next.getAs("name");mappedRows.add(id + "/" + partitionKey+ "/" + name);}int pid = TaskContext.get().partitionId();System.out.println("[partition: " + pid + "][size: " + mappedRows.size() + "]" + mappedRows);});}
}

调用方式:

import com.vip.spark.db.ConnectionInfos;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;import java.util.List;
import java.util.stream.Collectors;/*** Created by lesly.lai on 2018/7/23.*/
public class SparkSimpleTestPartition {public static void main(String[] args) throws InterruptedException {SparkSession sparkSession = SparkSession.builder().appName("Java Spark SQL basic example").getOrCreate();// 原始数据集Dataset<Row> originSet = sparkSession.read().jdbc(ConnectionInfos.TEST_MYSQL_CONNECTION_URL, "people", ConnectionInfos.getTestUserAndPasswordProperties());originSet.selectExpr("split(rowKey, '_')[0] as id","concat(split(rowKey, '_')[0],'_',split(rowKey, '_')[1]) as partition_key","split(rowKey, '_')[1] as name".createOrReplaceTempView("people");// 获取分区分布情况工具类SparkRddTaskInfo taskInfo = new SparkRddTaskInfo();Dataset<Row> groupSet = sparkSession.sql(" select partition_key from people group by partition_key");List<Object> groupList = groupSet.javaRDD().collect().stream().map(row -> row.getAs("partition_key")).collect(Collectors.toList());// 创建pairRDD 目前只有pairRdd支持自定义partitioner,所以需要先转成pairRddJavaPairRDD pairRDD = originSet.javaRDD().mapToPair(row -> {return new Tuple2(row.getAs("partition_key"), row);});// 指定自定义partitionerJavaRDD javaRdd = pairRDD.partitionBy(new CuxGroupPartitioner(groupList)).map(new Function<Tuple2<String, Row>, Row>(){@Overridepublic Row call(Tuple2<String, Row> v1) throws Exception {return v1._2;}});Dataset<Row> result = sparkSession.createDataFrame(javaRdd, originSet.schema());// 打印分区分布情况taskInfo.getTask(result);}
}

四、Dataset 重分区并且获取分区数

        System.out.println("1-->"+rowDataset.rdd().partitions().length);System.out.println("1-->"+rowDataset.rdd().getNumPartitions());Dataset<Row> hehe = rowDataset.coalesce(1);System.out.println("2-->"+hehe.rdd().partitions().length);System.out.println("2-->"+hehe.rdd().getNumPartitions());

运行结果:

1-->29
1-->29
2-->2
2-->2

注意:在使用 repartition() 时两次打印的结果相同:

print(rdd.getNumPartitions())
rdd.repartition(100)
print(rdd.getNumPartitions())

产生上述问题的原因有两个:
  首先 repartition() 是惰性求值操作,需要执行一个 action 操作才可以使其执行。
  其次,repartition() 操作会返回一个新的 rdd,并且新的 rdd 的分区已经修改为新的分区数,因此必须使用返回的 rdd,否则将仍在使用旧的分区。
  修改为:rdd2 = rdd.repartition(100)

参考:repartition() is not affecting RDD partition size

http://www.hkea.cn/news/223437/

相关文章:

  • 增城做网站要多少钱推广普通话手抄报
  • 石家庄网站系统开发智能搜索引擎
  • 迅速网站网络营销平台推广方案
  • 学前端要逛那些网站微信引流主动被加软件
  • 韩国flash网站免费手机网站建站平台
  • 东莞做网站卓诚网络昆明长尾词seo怎么优化
  • WordPress个性萌化插件郑州seo优化哪家好
  • 专业手机移动网站建设免费的seo优化
  • 西安网站建设王永杰域名注册 万网
  • 网站营销优化方案北京做的好的seo公司
  • 企业网站排名提升软件优化南宁seo优化
  • 创意合肥网站建设杭州seo公司排名
  • 网站专题页是什么中国十大关键词
  • 五月天做网站网络策划与营销
  • 高校网站如何建设论文谷歌官网下载
  • 做网站内容软件个人网站怎么做
  • 收废铁的做网站有优点吗海南百度推广开户
  • wordpress 二维码插件下载信阳搜索引擎优化
  • 个人网站二级域名做淘宝客企业推广策略
  • 厦门做网站seo的seo服务公司招聘
  • 安徽池州做企业网站百度搜索官方网站
  • 芜湖商城网站建设青岛百度快速优化排名
  • 我找伟宏篷布我做的事ko家的网站seoul怎么读
  • 即墨做网站优书网首页
  • 网站建设实践报告3000字放单平台
  • 中华人民共和国城乡住房建设厅网站seo技术外包
  • 网站做销售是斤么工作东莞网站营销推广
  • 做网站现在还行吗宁德市疫情
  • 响应式网站首页百度搜索资源
  • 工人找工作哪个网站好福州百度seo