还在用变量去实现多维度分组排序吗?你 out 了!

目录
一、什么是窗口函数
二、用窗口函数实现分组内排序
三、基于窗口函数的高效分页批处理方案


一、什么是窗口函数
窗口函数(Window Function)又叫开窗函数,是一种常见的 OLAP 函数,与聚合函数不同,窗口函数可以按多个维度分别做排序,简化了复杂分析场景的 SQL 逻辑。常见的单机数据库一般都支持窗口函数,TiDB v3.0,MySQL 8.0 版本也开始支持窗口函数功能。

二、用窗口函数实现分组内排序
分组并对组内排序是使用窗口函数的常见场景。
首先我们制作一张学生成绩表,包含学生姓名,学号,科目,以及科目成绩字段,并写入一些数据:

mysql> select * from class_score;
+--------------+-----------+-------------------------+-----------+
| stuname      | stuno     | course                  | courscore |
+--------------+-----------+-------------------------+-----------+
| SpongeBob    | 201903001 | LinearAlgebra           |      60.5 |
| SpongeBob    | 201903001 | AdvancedMathematics     |      55.0 |
| SpongeBob    | 201903001 | Physics                 |      65.0 |
| SpongeBob    | 201903001 | ProbabilityTheory       |      87.0 |
| SpongeBob    | 201903001 | PrincipleofStatistics   |      90.0 |
| SpongeBob    | 201903001 | OperatingSystem         |      95.0 |
| SpongeBob    | 201903001 | FundamentalsofCompiling |      43.0 |
| SpongeBob    | 201903001 | DiscreteMathematics     |      72.0 |
| SpongeBob    | 201903001 | PrinciplesofDatabase    |      88.0 |
| SpongeBob    | 201903001 | English                 |      79.0 |
| SpongeBob    | 201903001 | OpBasketball            |      92.0 |
| SpongeBob    | 201903001 | OpTennis                |      94.0 |
| PatrickStar  | 201903011 | LinearAlgebra           |       6.5 |
| PatrickStar  | 201903011 | AdvancedMathematics     |       5.0 |
| PatrickStar  | 201903011 | Physics                 |       6.0 |
| PatrickStar  | 201903011 | ProbabilityTheory       |      12.0 |
| PatrickStar  | 201903011 | PrincipleofStatistics   |      20.0 |
| PatrickStar  | 201903011 | OperatingSystem         |      36.0 |
| PatrickStar  | 201903011 | FundamentalsofCompiling |       2.0 |
| PatrickStar  | 201903011 | DiscreteMathematics     |      14.0 |
| PatrickStar  | 201903011 | PrinciplesofDatabase    |       9.0 |
| PatrickStar  | 201903011 | English                 |      60.0 |
| PatrickStar  | 201903011 | OpTableTennis           |      12.0 |
| PatrickStar  | 201903011 | OpPiano                 |      99.0 |
| MonkeyDLuffy | 201803015 | LinearAlgebra           |      92.5 |
| MonkeyDLuffy | 201803015 | AdvancedMathematics     |      95.5 |
| MonkeyDLuffy | 201803015 | Physics                 |      63.5 |
| MonkeyDLuffy | 201803015 | ProbabilityTheory       |      76.0 |
| MonkeyDLuffy | 201803015 | PrincipleofStatistics   |      69.0 |
| MonkeyDLuffy | 201803015 | OperatingSystem         |      90.5 |
| MonkeyDLuffy | 201803015 | FundamentalsofCompiling |      88.0 |
| MonkeyDLuffy | 201803015 | DiscreteMathematics     |      89.0 |
| MonkeyDLuffy | 201803015 | PrinciplesofDatabase    |      60.5 |
| MonkeyDLuffy | 201803015 | English                 |      43.0 |
| MonkeyDLuffy | 201803015 | OpSwimming              |      67.0 |
| MonkeyDLuffy | 201803015 | OpFencing               |      76.0 |
+--------------+-----------+-------------------------+-----------+
36 rows in set (0.01 sec)

业务需求 1:计算出每科成绩的前两名的姓名、学号和成绩
这是一个难以用聚合函数实现的需求,由于长期不支持窗口函数,MySQL 社区普遍推荐使用用户变量的方式来实现,具体实现方式如下:

mysql> SET @z := NULL;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @ROW_NUM := 0;
Query OK, 0 rows affected (0.00 sec)

mysql> select course, stuname, stuno, courscore from (select course, @ROW_NUM := IF(course = @z, @ROW_NUM + 1, 1) as ROW_NUM, @z := course AS z, stuname, courscore, stuno FROM (select * from class_score order by course, courscore desc) t1) t2 where t2.ROW_NUM<=2;
+-------------------------+--------------+-----------+-----------+
| course                  | stuname      | stuno     | courscore |
+-------------------------+--------------+-----------+-----------+
| AdvancedMathematics     | MonkeyDLuffy | 201803015 |      95.5 |
| AdvancedMathematics     | SpongeBob    | 201903001 |      55.0 |
| DiscreteMathematics     | MonkeyDLuffy | 201803015 |      89.0 |
| DiscreteMathematics     | SpongeBob    | 201903001 |      72.0 |
| English                 | SpongeBob    | 201903001 |      79.0 |
| English                 | PatrickStar  | 201903011 |      60.0 |
| FundamentalsofCompiling | MonkeyDLuffy | 201803015 |      88.0 |
| FundamentalsofCompiling | SpongeBob    | 201903001 |      43.0 |
| LinearAlgebra           | MonkeyDLuffy | 201803015 |      92.5 |
| LinearAlgebra           | SpongeBob    | 201903001 |      60.5 |
| OpBasketball            | SpongeBob    | 201903001 |      92.0 |
| OpFencing               | MonkeyDLuffy | 201803015 |      76.0 |
| OpPiano                 | PatrickStar  | 201903011 |      99.0 |
| OpSwimming              | MonkeyDLuffy | 201803015 |      67.0 |
| OpTableTennis           | PatrickStar  | 201903011 |      12.0 |
| OpTennis                | SpongeBob    | 201903001 |      94.0 |
| OperatingSystem         | SpongeBob    | 201903001 |      95.0 |
| OperatingSystem         | MonkeyDLuffy | 201803015 |      90.5 |
| Physics                 | SpongeBob    | 201903001 |      65.0 |
| Physics                 | MonkeyDLuffy | 201803015 |      63.5 |
| PrincipleofStatistics   | SpongeBob    | 201903001 |      90.0 |
| PrincipleofStatistics   | MonkeyDLuffy | 201803015 |      69.0 |
| PrinciplesofDatabase    | SpongeBob    | 201903001 |      88.0 |
| PrinciplesofDatabase    | MonkeyDLuffy | 201803015 |      60.5 |
| ProbabilityTheory       | SpongeBob    | 201903001 |      87.0 |
| ProbabilityTheory       | MonkeyDLuffy | 201803015 |      76.0 |
+-------------------------+--------------+-----------+-----------+
26 rows in set (0.01 sec)

通过定义两个用户变量,一个用于切换到下一组,另一个用来发放行号,以此来通过嵌套循环的方式来实现为每组单独发放行号。缺点是不能处理相同分数名次并列的情况,并且嵌套太多,逻辑比较复杂,每次计算都要为变量重新赋值。

来看一下窗口函数的实现方式,仅需要一条 SQL,一个子查询就可以得出各科成绩的前两名,注意这里使用的 rank() 函数可以识别相同分数名次并列的情况,也就是说假如一科出现了两人并列第一,使用下面的 SQL 可以公平的把并列第一的情况展现出来,这是用户变量难以实现的。

mysql> select course, stuname, stuno, courscore from (select *, rank() over(partition by course order by course, courscore desc) as RANK_ from class_score) t where t.RANK_<=2;
+-------------------------+--------------+-----------+-----------+
| course                  | stuname      | stuno     | courscore |
+-------------------------+--------------+-----------+-----------+
| AdvancedMathematics     | MonkeyDLuffy | 201803015 |      95.5 |
| AdvancedMathematics     | SpongeBob    | 201903001 |      55.0 |
| DiscreteMathematics     | MonkeyDLuffy | 201803015 |      89.0 |
| DiscreteMathematics     | SpongeBob    | 201903001 |      72.0 |
| English                 | SpongeBob    | 201903001 |      79.0 |
| English                 | PatrickStar  | 201903011 |      60.0 |
| FundamentalsofCompiling | MonkeyDLuffy | 201803015 |      88.0 |
| FundamentalsofCompiling | SpongeBob    | 201903001 |      43.0 |
| LinearAlgebra           | MonkeyDLuffy | 201803015 |      92.5 |
| LinearAlgebra           | SpongeBob    | 201903001 |      60.5 |
| OpBasketball            | SpongeBob    | 201903001 |      92.0 |
| OpFencing               | MonkeyDLuffy | 201803015 |      76.0 |
| OpPiano                 | PatrickStar  | 201903011 |      99.0 |
| OpSwimming              | MonkeyDLuffy | 201803015 |      67.0 |
| OpTableTennis           | PatrickStar  | 201903011 |      12.0 |
| OpTennis                | SpongeBob    | 201903001 |      94.0 |
| OperatingSystem         | SpongeBob    | 201903001 |      95.0 |
| OperatingSystem         | MonkeyDLuffy | 201803015 |      90.5 |
| Physics                 | SpongeBob    | 201903001 |      65.0 |
| Physics                 | MonkeyDLuffy | 201803015 |      63.5 |
| PrincipleofStatistics   | SpongeBob    | 201903001 |      90.0 |
| PrincipleofStatistics   | MonkeyDLuffy | 201803015 |      69.0 |
| PrinciplesofDatabase    | SpongeBob    | 201903001 |      88.0 |
| PrinciplesofDatabase    | MonkeyDLuffy | 201803015 |      60.5 |
| ProbabilityTheory       | SpongeBob    | 201903001 |      87.0 |
| ProbabilityTheory       | MonkeyDLuffy | 201803015 |      76.0 |
+-------------------------+--------------+-----------+-----------+
26 rows in set (0.01 sec)

业务需求 2:计算出每科成绩第一名与第二名之间的分差
TiDB 提供 lead() 与 lag() 函数来获取组内数据排序后的下一行或上一行的列值,此处正是使用了 lead() 函数来获取下一行的列值,通过子查询的方式即可计算出第一名与第二名之间的分差:

mysql> select course, courscore, courscore - lead_ as delta from (select *, lead(courscore,1) over(partition by course order by course, courscore desc) as lead_, rank() over(partition by course order by course, courscore desc) as RANK_ from class_score) t where t.RANK_=1;
+-------------------------+-----------+-------+
| course                  | courscore | delta |
+-------------------------+-----------+-------+
| AdvancedMathematics     |      95.5 |  40.5 |
| DiscreteMathematics     |      89.0 |  17.0 |
| English                 |      79.0 |  19.0 |
| FundamentalsofCompiling |      88.0 |  45.0 |
| LinearAlgebra           |      92.5 |  32.0 |
| OpBasketball            |      92.0 |  NULL |
| OpFencing               |      76.0 |  NULL |
| OpPiano                 |      99.0 |  NULL |
| OpSwimming              |      67.0 |  NULL |
| OpTableTennis           |      12.0 |  NULL |
| OpTennis                |      94.0 |  NULL |
| OperatingSystem         |      95.0 |   4.5 |
| Physics                 |      65.0 |   1.5 |
| PrincipleofStatistics   |      90.0 |  21.0 |
| PrinciplesofDatabase    |      88.0 |  27.5 |
| ProbabilityTheory       |      87.0 |  11.0 |
+-------------------------+-----------+-------+
16 rows in set (0.00 sec)

三、基于窗口函数的高效分页批处理方案
窗口函数作为数据库的高级分析功能,它的应用场景不仅限于分组内排序,我们还可以利用窗口函数做很多有意思的事情,比如本案例用窗口函数来大幅优化跑批中的分页处理效率。
我们用 sysbench 创建一张表并加载一些数据,用这张表来模拟批量处理逻辑。
首先初始化一张表 sbtest1,其表结构如下,其中 id 字段为整型主键:

mysql> desc sbtest1;
+-------+-----------+------+------+---------+----------------+
| Field | Type      | Null | Key  | Default | Extra          |
+-------+-----------+------+------+---------+----------------+
| id    | int(11)   | NO   | PRI  | NULL    | auto_increment |
| k     | int(11)   | NO   | MUL  | 0       |                |
| c     | char(120) | NO   |      |         |                |
| pad   | char(60)  | NO   |      |         |                |
+-------+-----------+------+------+---------+----------------+
4 rows in set (0.00 sec)

初始化时加载了 100 万行数据,之后我们删除掉其中一部分,通过这样的方式使 id 值不再连续,弱化分页时对于 id 值的依赖。当前表中剩余数据有 90 万行左右:

mysql> select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
|   899997 |
+----------+
1 row in set (0.65 sec)

表内数据预览:

mysql> select * from sbtest1 limit 6;
+--------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| id     | k      | c                                                                                                                       | pad                                                         |
+--------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
| 170713 | 502585 | 68207710198-92682096687-30191949979-36606876762-68131108662-05227395575-42775011851-25226186240-86628605904-92905646658 | 92965159868-07234410731-39167064470-14286085716-15715308680 |
| 170715 | 594870 | 03482720054-50379763215-87903836122-97559417898-49419423256-08561919665-14395666373-04552411341-51225532045-80056729812 | 14534783486-12748024297-66217900494-07062661389-59419864770 |
| 170716 | 618106 | 17284178744-35252021030-57793972189-12648949390-90678614158-50453793363-79361198568-92739087625-90147799094-56275382145 | 96022213702-57054390589-17717245768-83668730988-26655128451 |
| 170717 | 498071 | 55266913813-66118089063-10841700714-78346894223-87037025257-46356741961-50684103191-23859048041-87607902200-58092836685 | 85952977843-18323978167-65380568194-90178704467-17391816925 |
| 170718 | 500843 | 81176419361-91278769025-45575469479-70005546210-57581523030-24528178176-84655463505-48851510236-43885747093-01732211221 | 56651630364-99235825673-25852643818-33561663285-01699695675 |
| 170719 | 499063 | 87982690236-17188898588-98406118277-04805507744-90184035670-09591916010-78045349706-89374841792-79952082330-08177876709 | 16885918921-25441055158-88415348869-22003000705-82198521530 |
+--------+--------+-------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------+
6 rows in set (0.01 sec)

常见的分页更新 SQL 一般使用主键/唯一索引进行排序,以确保相邻的两页之间没有空隙或重叠,配合 MySQL limit 语法中非常好用的 offset 功能来按固定行数拆分页面,拆分后的页面被包装在独立的事务中,可以灵活的进行逐页或批量对数据进行更新。

begin;
update sbtest1 set pad='new_value' where id in (select id from sbtest1 order by id limit 0,10000);
commit;
begin;
update sbtest1 set pad='new_value' where id in (select id from sbtest1 order by id limit 10000,10000);
commit;
begin;
update sbtest1 set pad='new_value' where id in (select id from sbtest1 order by id limit 20000,10000);
commit;

这种方案逻辑清晰,SQL 易于编写,但它有着明显的劣势,由于需要对主键/唯一索引进行排序,越靠后的页面需要参与排序的行数越多,TiKV 中扫描数据的压力也越大,批量整体处理效率就越低,当批量的整体数据量比较大时,很可能会占用过多计算资源,甚至触发性能瓶颈,影响联机业务。

下面案例是一种改进方案,通过灵活运用窗口函数 row_number() 将数据按照主键排序后赋予行号,再通过聚合函数按照设置好的页面大小对行号进行分组,以计算出每页的最大值和最小值。

mysql> select min(t.id) as start_key, max(t.id) as end_key, count(*) as page_size from (select *, row_number() over(order by id) as row_num from sbtest1) t group by floor((t.row_num-1)/50000) order by start_key;
+-----------+---------+-----------+
| start_key | end_key | page_size |
+-----------+---------+-----------+
|         1 |   55556 |     50000 |
|     55557 |  111111 |     50000 |
|    111112 |  166667 |     50000 |
|    166668 |  222222 |     50000 |
|    222223 |  277778 |     50000 |
|    277779 |  333333 |     50000 |
|    333335 |  388889 |     50000 |
|    388890 |  444445 |     50000 |
|    444446 |  500000 |     50000 |
|    500001 |  555556 |     50000 |
|    555557 |  611111 |     50000 |
|    611112 |  666667 |     50000 |
|    666668 |  722223 |     50000 |
|    722225 |  777779 |     50000 |
|    777780 |  833335 |     50000 |
|    833336 |  888891 |     50000 |
|    888892 |  944447 |     50000 |
|    944448 | 1000000 |     49997 |
+-----------+---------+-----------+
18 rows in set (1.87 sec)

将这个结果集作为批量处理的元信息,这样在批量处理阶段只需要使用 between...and... 来圈定好每个页面的数据,多个页面并发的进行批量更新即可,由于元信息的计算阶段使用主键/唯一索引进行排序,并用 row_number() 函数赋予了唯一序号,因此也可以避免在两个相邻的页面中出现空隙或重叠。
使用这种方案可以显著避免由于频繁,大量的排序造成的性能损耗,进而大幅提升批量处理的整体效率。

mysql> update sbtest1 set pad='new_value' where id between 1 and 55556;
Query OK, 50000 rows affected (3.51 sec)
Rows matched: 50000  Changed: 50000  Warnings: 0

mysql> update sbtest1 set pad='new_value' where id between 55557 and 111111;
Query OK, 50000 rows affected (2.14 sec)
Rows matched: 50000  Changed: 50000  Warnings: 0

mysql> update sbtest1 set pad='new_value' where id between 111112 and 166667;
Query OK, 50000 rows affected (2.21 sec)
Rows matched: 50000  Changed: 50000  Warnings: 0

四、复合主键分页案例

  1. 制作元信息表
mysql> SELECT floor(( t1.row_num - 1 )/ 600000 )+1 rn, min(mvalue),max(mvalue),count(*) FROM (SELECT concat( '(''', customer_id, ''',''', customer_idno, ''')' ) AS mvalue, row_number() over ( ORDER BY customer_id, customer_idno ) AS row_num FROM findpt.customer) t1  GROUP BY floor(( t1.row_num - 1 )/ 600000 )  ORDER BY rn;
+----+--------------------------------------+--------------------------------------+----------+
| rn | min(mvalue)                          | max(mvalue)                          | count(*) |
+----+--------------------------------------+--------------------------------------+----------+
|  1 | ('10000000001','351421198512031871') | ('10000600000','541420198607276566') |   600000 |
|  2 | ('10000600001','410727197307043818') | ('10001200000','221518199305165132') |   600000 |
|  3 | ('10001200001','521527198406224414') | ('10001800000','320209197609305969') |   600000 |
|  4 | ('10001800001','220304197912193073') | ('10002400000','230504197308067651') |   600000 |
|  5 | ('10002400001','121711197208214015') | ('10003000000','430112199003258074') |   600000 |
|  6 | ('10003000001','330609198706142725') | ('10003600000','520519197407128506') |   600000 |
|  7 | ('10003600001','621108199508175476') | ('10004200000','631319197203254252') |   600000 |
|  8 | ('10004200001','350406198608214809') | ('10004800000','500827199406068657') |   600000 |
|  9 | ('10004800001','450311198612295355') | ('10005400000','430713199601229738') |   600000 |
| 10 | ('10005400001','640608199311094703') | ('10006000000','131222199007068025') |   600000 |
| 11 | ('10006000001','110724197808158121') | ('10006600000','410909199902088607') |   600000 |
| 12 | ('10006600001','371802199909286692') | ('10007200000','331616199104157617') |   600000 |
| 13 | ('10007200001','631618198707015770') | ('10007800000','311424198409271703') |   600000 |
| 14 | ('10007800001','450212199805062337') | ('10008400000','141520197703176129') |   600000 |
| 15 | ('10008400001','130920197811106553') | ('10009000000','640206197509055077') |   600000 |
| 16 | ('10009000001','151822197801136758') | ('10009600000','810620197505228665') |   600000 |
| 17 | ('10009600001','230109198906203721') | ('10010000000','340408198312036321') |   400000 |
+----+--------------------------------------+--------------------------------------+----------+
17 rows in set (26.42 sec)
  1. 操作分页的案例
delete from customer where  (customer_id, customer_idno) >= ('10000000001','351421198512031871') and  (customer_id, customer_idno) <= ('10000600000','541420198607276566') order by customer_id,customer_idno;

另外可以使用隐藏字段 _tidb_rowid 做分页使用。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 156,907评论 4 360
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 66,546评论 1 289
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 106,705评论 0 238
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,624评论 0 203
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 51,940评论 3 285
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,371评论 1 210
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,672评论 2 310
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,396评论 0 195
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,069评论 1 238
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,350评论 2 242
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 31,876评论 1 256
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,243评论 2 251
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 32,847评论 3 231
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,004评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,755评论 0 192
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,378评论 2 269
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,266评论 2 259