Hive实战项目——影音网站数据分析

0.需求描述

统计 谷粒视频 网站的常规指标,各种 TopN 指标:

--统计视频观看数 Top10
--统计视频类别热度 Top10
--统计视频观看数 Top20 所属类别
--统计视频观看数 Top50 所关联视频的所属类别 Rank
--统计每个类别中的视频热度 Top10
--统计每个类别中视频流量 Top10
--统计上传视频最多的用户 Top10 以及他们上传的视频
--统计每个类别视频观看数 Top10

1.数据预处理

对将要处理的数据先进行一次数据清洗,过滤掉不合格的脏数据,同时调整数据的格式


image.png
1.1 将影片类型一栏->People & Blogs 中的空格去掉
1.2 将第9列以后的数据(关联的其他影片ID)用“&“符号连接 (因为在建表时这两列都需要设置为array类型,所以要统一分隔字符)
1.3 建立maven工程

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.victor</groupId>
    <artifactId>ETLUtils</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <build>
    <plugins><plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin></plugins></build>
    
    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>RELEASE</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-core</artifactId>
            <version>2.8.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.7.2</version>
        </dependency>
        <dependency>
            <groupId>jdk.tools</groupId>
            <artifactId>jdk.tools</artifactId>
            <version>1.8</version>
            <scope>system</scope>
            <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
        </dependency>
    </dependencies>
</project>

ETLUtilMapper.java

package com.victor.ETLUtils;

import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class ETLUtilsMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
//  PF_ZMlw4rHs DisneyUnleashed 729 Film&Animation 288 96 3.5 2 0 1 xbSFrHzFQ0 4
//  VP4qSjDNQs RJgGeYiJrj0

    Text k = new Text();

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context)
            throws IOException, InterruptedException {
        String line = value.toString();

        // 将第四个字段中的空格去掉
        String newline = util(line);
        if (StringUtils.isNotBlank(newline)) {
            k.set(newline);
            context.write(k, NullWritable.get());
        }
    }

    private String util(String line) {

        StringBuffer sb = new StringBuffer();
        String[] fields = line.split("\t");
        if (fields.length < 9) {
            return null;
        }
        fields[3] = fields[3].replaceAll(" ", "");
        for (int i = 0; i < fields.length; i++) {
            // 如果字段小于9 ,丢弃

            if (i < 9) {
                sb.append(fields[i] + "\t");

            } else {
                if (fields.length-1 == i) {
                    sb.append(fields[i]);
                } else {
                    sb.append(fields[i] + "&");
                }
            }

        }

        return sb.toString();
    }

}

ETLUtilDriver.java

package com.victor.ETLUtils;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import com.sun.jersey.core.impl.provider.entity.XMLJAXBElementProvider.Text;

public class ETLUtilsDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //args = new String[] {"2.txt","output"};
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        job.setJarByClass(ETLUtilsDriver.class);
        job.setMapperClass(ETLUtilsMapper.class);
        
        
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        
        
        
        job.setNumReduceTasks(0);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        boolean result = job.waitForCompletion(true);
        System.exit(result?0:1);
    }

}
1.4 启动集群,将工程打成Jar包,并与数据文件一并上传至集群中(省略)
1.5 清洗数据
[hadoop@hadoop131 software]$ yarn jar ETLUtils-0.0.1-SNAPSHOT.jar com.victor.ETLUtils.ETLUtilsDriver /videotable /videoOutput

处理前数据


image.png

处理后数据


image.png

2.Hive建表

思路:先分别建立video和user 的原始表,再建立video和user的ORC压缩表以提高传输执行效率

2.1 建立原始表

gulivideo_ori

CREATE TABLE gulivideo_ori(
videoId string, 
uploader string, 
age int, 
category array<string>, 
length int, 
views int, 
rate float, 
ratings int, 
comments int,
relatedId array<string>
)
row format delimited 
fields terminated by '\t'
collection items terminated by '&';

guli_user_ori

create table guli_user_ori(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as textfile;
2.2 加载数据
load data inpath '/videotable' into table gulivideo_ori;
load data inpath '/usertable' into table guli_user_ori;
2.3 建立ORC表
create table gulivideo_orc(
videoId string, 
uploader string, 
age int, 
category array<string>, 
length int, 
views int, 
rate float, 
ratings int, 
comments int,
relatedId array<string>
)
row format delimited 
fields terminated by '\t'
collection items terminated by '&'
stored as orc;
create table guli_user_orc(
uploader string,
videos int,
friends int)
row format delimited
fields terminated by "\t"
stored as orc;
2.4 向ORC表加载数据
insert into gulivideo_orc
select * from gulivideo_ori;
insert into guli_user_orc
select * from guli_user_ori;

3. 业务分析

3.1 统计视频观看数 Top10
select videoid,views 
from gulivideo_orc 
order by views desc 
limit 10;
+--------------+-----------+--+
|   videoid    |   views   |
+--------------+-----------+--+
| dMH0bHeiRNg  | 42513417  |
| 0XxI-hvPRRA  | 20282464  |
| 1dmVU08zVpA  | 16087899  |
| RB-wUgnyGv0  | 15712924  |
| QjA5faZF1A8  | 15256922  |
| -_CSo1gOd48  | 13199833  |
| 49IDp76kjPw  | 11970018  |
| tYnn51C3X_w  | 11823701  |
| pv5zWaTEVkI  | 11672017  |
| D2kJZOfq7zk  | 11184051  |
+--------------+-----------+--+
3.2 统计视频类别热度 Top10

3.2.1 将表中category字段数组行转列
select views,hot from
gulivideo_orc lateral view explode (category) category_t as hot;t1

3.2.2 统计每个类别的观看总数
select hot,count(*)
from t1
group by hot;t2

3.2.3 获取观看前10的类别
select hot,total_view
from ()t2
order by total_view desc limit 10;

SELECT   hot, 
         total_view 
FROM     ( 
                  SELECT   hot, 
                           Count(*) total_view 
                  FROM     ( 
                                  SELECT views, 
                                         hot 
                                  FROM   gulivideo_orc lateral view explode (category) category_t as hot)t1
                  GROUP BY hot)t2 
ORDER BY total_view DESC limit 10;
+----------------+-------------+--+
| category_name  | total_view  |
+----------------+-------------+--+
| Music          | 179049      |
| Entertainment  | 127674      |
| Comedy         | 87818       |
| Film           | 73293       |
|  Animation     | 73293       |
| Sports         | 67329       |
|  Games         | 59817       |
| Gadgets        | 59817       |
| People         | 48890       |
|  Blogs         | 48890       |
+----------------+-------------+--+
3.3 统计视频观看数 Top20 所属类别

3.3.1 观看数top20视频
select views,category
from gulivideo_orc
order by views desc
limit 20;t1
3.3.2 所属类别
select views,category
from t1 lateral view explode(category)ct as category_name;

SELECT views, 
       category_name 
FROM   ( 
                SELECT   views, 
                         category 
                FROM     gulivideo_orc 
                ORDER BY views DESC limit 20)t1 lateral VIEW explode(category)ct AS category_name ;
+-----------+----------------+--+
|   views   | category_name  |
+-----------+----------------+--+
| 42513417  | Comedy         |
| 20282464  | Comedy         |
| 16087899  | Entertainment  |
| 15712924  | Entertainment  |
| 15256922  | Music          |
| 13199833  | People         |
| 13199833  |  Blogs         |
| 11970018  | Comedy         |
| 11823701  | Music          |
| 11672017  | Music          |
| 11184051  | People         |
| 11184051  |  Blogs         |
| 10786529  | Entertainment  |
| 10334975  | Entertainment  |
| 10107491  | Comedy         |
| 9579911   | Music          |
| 9566609   | Comedy         |
| 8825788   |  UNA           |
| 7533070   | Music          |
| 7456875   | Entertainment  |
| 7066676   | Comedy         |
| 6952767   | Entertainment  |
+-----------+----------------+--+
3.4 统计视频观看数 Top50 所关联视频的所属类别 Rank

3.4.1 观看数top10,关联视频
select
videoid,views,category,relatedid
from
gulivideo_orc
order by
views desc
limit 50;t1

3.4.2 关联视频行转列
select distinct(r_id)
from
t1 lateral view explode(relatedid) relatedtable as r_id;t2

3.4.3 视频所属类别
select r_id,g.category
from
t2.join gulivideo_orc g on r_id = g.videoid;t3

select r_id,g.category
from
t2 join gulivideo_orc g on r_id = g.videoid;t3

3.4.4 类别展开
select category_name
from ()t3 lateral view explode(category)t as category_name;t4

3.4.5 统计类别个数
select category_name,count(*) hot
from
t4 group by category_name,t_sum;t5

3.4.6 所属类别排名
select * from
t5
order by hot desc;t6

SELECT   * 
FROM     ( 
                  SELECT   category_name, 
                           Count(*) hot 
                  FROM     ( 
                                  SELECT category_name 
                                  FROM   ( 
                                                SELECT r_id, 
                                                       g.category 
                                                FROM   ( 
                                                                       SELECT DISTINCT(r_id) 
                                                                       FROM            ( 
                                                                                                SELECT   videoid,
                                                                                                         views,
                                                                                                         category,
                                                                                                         relatedid
                                                                                                FROM     gulivideo_orc
                                                                                                ORDER BY views DESC limit 50)t1 lateral VIEW explode(relatedid) relatedtable as r_id)t2
                                                JOIN   gulivideo_orc g 
                                                ON     r_id = g.videoid)t3 lateral VIEW explode(category)t as category_name )t4
                  GROUP BY category_name)t5 
ORDER BY hot DESC;
+-------------------+---------+--+
| t6.category_name  | t6.hot  |
+-------------------+---------+--+
| Comedy            | 14      |
| Entertainment     | 11      |
| Music             | 10      |
| Film              | 3       |
|  Animation        | 3       |
| People            | 2       |
|  Blogs            | 2       |
| Travel            | 1       |
| Sports            | 1       |
| Howto             | 1       |
|  Places           | 1       |
|  DIY              | 1       |
+-------------------+---------+--+
3.5 统计每个类别中的视频热度 Top10
SELECT category_name, 
       views,
       videoid 
FROM   ( 
                SELECT   category_name, 
                         videoid, 
                         views,
                         Rank() OVER(partition BY category_name order by views) rank_no 
                FROM     ( 
                                  SELECT   category_name, 
                                           videoid, 
                                           views 
                                  FROM     gulivideo_orc lateral view explode(category) t as category_name
                                  GROUP BY category_name, 
                                           videoid, 
                                           views 
                                  ORDER BY category_name, 
                                           views DESC)t1 
                ORDER BY category_name)t2 
WHERE  rank_no<=10;
3.6 统计每个类别中视频流量 Top10
3.7 统计上传视频最多的用户 Top10 以及他们上传的视频

1.找出上传前10的用户
select uploader,
videos
from
guli_user_orc
order by videos desc
limit 10;t1

2.找到上传的所有视频
select t1.uploader,
videoid,
views
from
()t1 join gulivideo_orc g
on
t.uploader=g.uploader
order by uploader,views desc; t2

SELECT t1.uploader, 
       videoid, 
       views 
FROM   (SELECT uploader, 
               videos 
        FROM   guli_user_orc 
        ORDER  BY videos DESC 
        LIMIT  10)t1 
       JOIN gulivideo_orc g 
         ON t1.uploader = g.uploader 
order by uploader,views desc; 
3.8 统计每个类别视频观看数 Top10

1.统计所有类别对应的视频
select
category_name,videoid,views
from
gulivideo_orc
lateral view explode(category) t as category_name;t1

2.对每个类观看数排名
select *,rank() over(partition by category_name order by views desc) rank_no
from
()t1;t2

3.取前十
select * from
()t2
where rank_no<=10;

SELECT * 
FROM   ( 
                SELECT   *, 
                         Rank() OVER(partition BY category_name ORDER BY views DESC) rank_no 
                FROM     ( 
                                SELECT category_name, 
                                       videoid, 
                                       views 
                                FROM   gulivideo_orc lateral view explode(category) t as category_name)t1 )t2
WHERE  rank_no<=10;
+-------------------+--------------+-----------+-------------+--+
| t2.category_name  |  t2.videoid  | t2.views  | t2.rank_no  |
+-------------------+--------------+-----------+-------------+--+
|  Animals          | 2GWPOPSXGYI  | 3660009   | 1           |
|  Animals          | xmsV9R8FsDA  | 3164582   | 2           |
|  Animals          | 12PsUW-8ge4  | 3133523   | 3           |
|  Animals          | OeNggIGSKH8  | 2457750   | 4           |
|  Animals          | WofFb_eOxxA  | 2075728   | 5           |
|  Animals          | AgEmZ39EtFk  | 1999469   | 6           |
|  Animals          | a-gW3RbJd8U  | 1836870   | 7           |
|  Animals          | 8CL2hetqpfg  | 1646808   | 8           |
|  Animals          | QmroaYVD_so  | 1645984   | 9           |
|  Animals          | Sg9x5mUjbH8  | 1527238   | 10          |
|  Animation        | sdUUx5FdySs  | 5840839   | 1           |
|  Animation        | 6B26asyGKDo  | 5147533   | 2           |
|  Animation        | H20dhY01Xjk  | 3772116   | 3           |
|  Animation        | 55YYaJIrmzo  | 3356163   | 4           |
|  Animation        | JzqumbhfxRo  | 3230774   | 5           |
|  Animation        | eAhfZUZiwSE  | 3114215   | 6           |
|  Animation        | h7svw0m-wO0  | 2866490   | 7           |
|  Animation        | tAq3hWBlalU  | 2830024   | 8           |
|  Animation        | AJzU3NjDikY  | 2569611   | 9           |
|  Animation        | ElrldD02if0  | 2337238   | 10          |

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 160,277评论 4 364
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,777评论 1 298
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 109,946评论 0 245
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,271评论 0 213
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,636评论 3 288
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,767评论 1 221
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,989评论 2 315
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,733评论 0 204
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,457评论 1 246
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,674评论 2 249
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,155评论 1 261
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,518评论 3 258
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,160评论 3 238
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,114评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,898评论 0 198
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,822评论 2 280
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,705评论 2 273

推荐阅读更多精彩内容