当前位置 博文首页 > ?:Spark大数据分析与实战:HDFS文件操作

    ?:Spark大数据分析与实战:HDFS文件操作

    作者:[db:作者] 时间:2021-07-17 19:06

    Spark大数据分析与实战:HDFS文件操作

    一、安装Hadoop和Spark

    具体的安装过程在我以前的博客里面有,大家可以通过以下链接进入操作

    Linux基础环境搭建(CentOS7)- 安装Hadoop
    Linux基础环境搭建(CentOS7)- 安装Scala和Spark

    二、启动Hadoop与Spark

    查看3个节点的进程

    master在这里插入图片描述
    slave1
    在这里插入图片描述
    slave2
    在这里插入图片描述

    Spark shell命令界面与端口页面

    在这里插入图片描述
    在这里插入图片描述

    三、HDFS 常用操作

    (1) 启动Hadoop,在HDFS 中创建用户目录“/user/hadoop”;

    Shell命令:

    [root@master ~]# hadoop fs -mkdir /user/hadoop
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    

    运行截图:
    在这里插入图片描述

    在这里插入图片描述

    (2) 在Linux 系统的本地文件系统的“/home/hadoop”目录下新建一个文本文件test.txt,并在该文件中随便输入一些内容,然后上传到HDFS 的“/user/hadoop” 目录下;

    Shell命令:

    [root@master ~]# vim /usr/hadoop/test.txt
    [root@master ~]# hadoop fs -put /usr/hadoop/test.txt  /user/hadoop
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    [root@master ~]# hadoop fs -cat /user/hadoop/test.txt
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    I love hadoop and spark!
    John Zhuang!
    

    运行截图:

    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述

    (3) 把HDFS 中“/user/hadoop”目录下的test.txt 文件,下载到Linux 系统的本地文件系统中的“/home/hadoop/下载”目录下;

    Shell 命令:

    [root@master ~]# mkdir /usr/hadoop/download
    [root@master ~]# hadoop fs -get /user/hadoop/test.txt /usr/hadoop/download/
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    [root@master ~]# cat /usr/h
    hadoop/ hbase/  hive/   
    [root@master ~]# cat /usr/hadoop/test.txt 
    I love hadoop and spark!
    John Zhuang!
    

    运行截图:
    在这里插入图片描述

    (4)将HDFS中“/user/hadoop”目录下的test.txt文件的内容输出到终端中进行显示;

    Shell 命令:

    [root@master ~]# hadoop fs -text /user/hadoop/test.txt
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    I love hadoop and spark!
    John Zhuang!
    

    运行截图:
    在这里插入图片描述

    (5)在HDFS 中的“/user/hadoop” 目录下, 创建子目录input ,把HDFS 中“/user/hadoop”目录下的test.txt 文件,复制到“/user/hadoop/input”目录下;

    Shell命令:

    [root@master ~]# hadoop fs -mkdir /user/hadoop/input
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    [root@master ~]# hadoop fs -cp /user/hadoop/test.txt /user/hadoop/input
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    

    运行截图:
    在这里插入图片描述
    在这里插入图片描述

    (6)删除HDFS中“/user/hadoop”目录下的test.txt文件,删除HDFS中“/user/hadoop” 目录下的input 子目录及其子目录下的所有内容。

    Shell命令:

    [root@master ~]# hadoop fs -rm -r /user/hadoop/input/
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    21/03/18 17:22:54 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
    Deleted /user/hadoop/input
    [root@master ~]# hadoop fs -ls /user/hadoop/
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.7.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/hbase/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    Found 1 items
    -rw-r--r--   2 root supergroup         38 2021-03-18 17:14 /user/hadoop/test.txt
    

    运行截图:
    在这里插入图片描述
    在这里插入图片描述

    四、Spark 读取文件系统的数据

    (1)在spark-shell 中读取Linux 系统本地文件“/home/hadoop/test.txt”,然后统计出文件的行数;

    Shell命令:

    [root@master spark-2.4.0-bin-hadoop2.7]# spark-shell --master spark://master:7077 --executor-memory 512M
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://master:4040
    Spark context available as 'sc' (master = spark://master:7077, app id = app-20210318230638-0000).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
          /_/
             
    Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> var file=sc.textFile("file:///usr/hadoop/test.txt")
    file: org.apache.spark.rdd.RDD[String] = file:///usr/hadoop/test.txt MapPartitionsRDD[1] at textFile at <console>:24
    
    scala> val length = file.count()
    length: Long = 2
    

    运行截图:
    在这里插入图片描述

    (2)在spark-shell 中读取HDFS 系统文件“/user/hadoop/test.txt”(如果该文件不存在, 请先创建),然后,统计出文件的行数;

    Shell命令:

    [root@master ~]# spark-shell --master spark://master:7077 --executor-memory 512M
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://master:4040
    Spark context available as 'sc' (master = spark://master:7077, app id = app-20210318231630-0000).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
          /_/
             
    Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> var hdfsfile=sc.textFile("/user/hadoop/test.txt")
    hdfsfile: org.apache.spark.rdd.RDD[String] = /user/hadoop/test.txt MapPartitionsRDD[1] at textFile at <console>:24
    
    scala> val length = hdfsfile.count()
    length: Long = 2         
    

    运行截图:
    在这里插入图片描述

    (3)编写独立应用程序,读取HDFS 系统文件“/user/hadoop/test.txt”(如果该文件不存在, 请先创建),然后,统计出文件的行数;通过sbt 工具将整个应用程序编译打包成 JAR 包, 并将生成的JAR 包通过 spark-submit 提交到 Spark 中运行命令。

    提示:如果IDEA未构建Spark项目,可以转接到以下的博客:

    IDEA使用Maven构建Spark项目:https://blog.csdn.net/weixin_47580081/article/details/115435536

    源代码:

    package com.John.Sparkstudy.SparkTest.Test01
    
    import org.apache.spark.{SparkConf, SparkContext}
    
    /**
     * @author John
     * @Date 2021/3/19 7:27
     */
    object CountTest01 {
      def main(args: Array[String]): Unit = {
        // 创建SparkContext
        val conf = new SparkConf().setAppName("test_Count")
        val sc = new SparkContext(conf)
    
        // 加载文件
        val file = sc.textFile("hdfs:///user/hadoop/test.txt")
    
        // 处理文件
        val num = file.count()
    
        // 输出结果
        println("文件 " + file.name + " 的行数为:" + num + " 行")
      }
    
    }
    

    Shell命令:

    [root@master jarFile]# rz -E
    rz waiting to receive.
    [root@master jarFile]# ll
    总用量 8
    -rw-r--r--. 1 root root 6394 3月  19 07:53 original-SparkTest-1.0-SNAPSHOT.jar
    [root@master jarFile]# cd /usr/spark/spark-2.4.0-bin-hadoop2.7/
    [root@master spark-2.4.0-bin-hadoop2.7]# bin/spark-submit --class com.John.Sparkstudy.SparkTest.Test01.CountTest01 --master spark://master:7077 /usr/testFile/jarFile/original-SparkTest-1.0-SNAPSHOT.jar
    

    运行截图:

    在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述

    cs
    下一篇:没有了