注册网站需要房产证,湖州做网站推广的公司,深圳设计院跳槽事件,中山蚂蚁网站开发1.8 Spark编程入门
1.8.1 通过IDEA创建Spark工程
ps:工程创建之前步骤省略,在scala中已经讲解,直接默认是创建好工程的 导入Pom文件依赖 !-- 声明公有的属性 --propertiesmaven.compiler.source1.8/maven.compiler.sourcemaven.compiler…1.8 Spark编程入门
1.8.1 通过IDEA创建Spark工程
ps:工程创建之前步骤省略,在scala中已经讲解,直接默认是创建好工程的 导入Pom文件依赖 !-- 声明公有的属性 --propertiesmaven.compiler.source1.8/maven.compiler.sourcemaven.compiler.target1.8/maven.compiler.targetencodingUTF-8/encodingscala.version2.12.8/scala.versionspark.version3.1.2/spark.versionhadoop.version3.2.1/hadoop.versionscala.compat.version2.12/scala.compat.version/properties
!-- 声明并引入公有的依赖 --dependenciesdependencygroupIdorg.scala-lang/groupIdartifactIdscala-library/artifactIdversion${scala.version}/version/dependencydependencygroupIdorg.apache.spark/groupIdartifactIdspark-core_2.12/artifactIdversion${spark.version}/version/dependencydependencygroupIdorg.apache.hadoop/groupIdartifactIdhadoop-client/artifactIdversion${hadoop.version}/version/dependency/dependencies
!-- 配置构建信息 --build!-- 资源文件夹 --sourceDirectorysrc/main/scala/sourceDirectory!-- 声明并引入构建的插件 --plugins!-- 用于编译Scala代码到class --plugingroupIdnet.alchim31.maven/groupIdartifactIdscala-maven-plugin/artifactIdversion3.2.2/versionexecutionsexecutiongoalsgoalcompile/goalgoaltestCompile/goal/goalsconfigurationargsarg-dependencyfile/argarg${project.build.directory}/.scala_dependencies/arg/args/configuration/execution/executions/pluginplugin!-- 程序打包 --groupIdorg.apache.maven.plugins/groupIdartifactIdmaven-shade-plugin/artifactIdversion2.4.3/versionexecutionsexecutionphasepackage/phasegoalsgoalshade/goal/goalsconfiguration!-- 过滤掉以下文件不打包 解决包重复引用导致的打包错误--filtersfilterartifact*:*/artifactexcludesexcludeMETA-INF/*.SF/excludeexcludeMETA-INF/*.DSA/excludeexcludeMETA-INF/*.RSA/exclude/excludes/filter/filterstransformers!-- 打成可执行的jar包 的主方法入口--transformer implementationorg.apache.maven.plugins.shade.resource.ManifestResourceTransformermainClass/mainClass/transformer/transformers/configuration/execution/executions/plugin/plugins/build
1.8.2 Scala实现WordCount
package com.qianfeng.sparkcore
import org.apache.spark.{SparkConf, SparkContext}
/*** 使用Spark统计单词个数*/
object Demo01_SparkWC {def main(args: Array[String]): Unit {//1、获取spark上下文环境 local[n] : n代表cpu核数*代表可用的cpu数量如果打包服务器运行则需要注释掉.setMaster()val conf new SparkConf().setAppName(spark-wc).setMaster(local[*])val sc new SparkContext(conf)//2、初始化数据val rdd sc.textFile(/Users/liyadong/data/sparkdata/test.txt)//3、对数据进行加工val sumRDD rdd.filter(_.length 10).flatMap(_.split(\t)).map((_, 1)).reduceByKey(_ _)//4、对数据进行输出println(sumRDD.collect().toBuffer)sumRDD.foreach(println(_))
//5、关闭sc对象sc.stop()}
}
1.8.3 程序打包上传集群 在Spark安装目录中的bin目录进行提交作业操作
spark-submit \
--class com.qianfeng.sparkcore.Demo01_SparkWC \
--master yarn \
--deploy-mode client \
/home/original-hn-bigdata-1.0.jar hdfs://qianfeng01:9820/words hdfs://qianfeng01:9820/output/0901
注意:如果HDFS集群中有数据文件直接使用集群的数据文件即可如果没有的话使用【hdfs dfs -put /home/words /】从Linux系统中将文件上传到HDFS查看集群中运行之后的结果【hdfs dfs -tail output/0901/*】
Guff_hys_python数据结构,大数据开发学习,python实训项目-CSDN博客