本文讨论apache spark的window
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
这篇文章说的比书上清楚,虽然写书的也是这个公司的
书是指 Spark – The Definitive Guide – Big data processing made simple
请自行购买,参考第七章 Window Functions部分
这篇博客不是全面的描述,因为别人已经说的很清楚了,是基于书和网文的基础上的思考和试验,先读上面相关网址和书籍有助于理解本文
Every input row can have a unique frame associated with it
每一行都有一个不同的frame,那么现在的问题是,处理了windows函数之后,结果的行数是多少?是和原来一样?还是分组之后的行数?
原始数据如下,我们把它注册为productRevenue表
+-------+--------+-------+ |product|category|revenue| +-------+--------+-------+ | p1| c1| 1| | p2| c1| 2| | p2| c2| 3| | p3| c1| 4| | p3| c2| 5| | p3| c3| 6| +-------+--------+-------+
执行如下sql语句,功能:计算每个类别下的每个产品销售排名
SELECT product, category, revenue, dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank FROM productRevenue
结果
+-------+--------+-------+----+ |product|category|revenue|rank| +-------+--------+-------+----+ | p3| c1| 4| 1| | p2| c1| 2| 2| | p1| c1| 1| 3| | p3| c3| 6| 1| | p3| c2| 5| 1| | p2| c2| 3| 2| +-------+--------+-------+----+
行数没有变,还是6行
下面写另外一功能,每个产品和该类别下的销售冠军的差别
google搜spark UNBOUNDEDPRECEDING
select product,category,revenue, max(revenue) over (PARTITION BY category order by revenue rows between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) - revenue as diff from productRevenue
结果
+-------+--------+-------+----+ |product|category|revenue|rank| +-------+--------+-------+----+ | p3| c1| 4| 1| | p2| c1| 2| 2| | p1| c1| 1| 3| | p3| c3| 6| 1| | p3| c2| 5| 1| | p2| c2| 3| 2| +-------+--------+-------+----+
行数还是6行
这里可以 select product,category,revenue,每一列,如果是一般的group by,只能选择group by使用的列,否则必须使用聚合函数,还有这里max… – revenue
这里减去的并不是一个group by使用的列,而是一个平常的列
所以猜想,window函数不会把原始的每一组变成一行,而是分到不同的组中,然后全部展示,就有点像,按照category排序一样,但是各种计算会在同一组中分别进行,比排序order by和排名rank等等
参考网文中的
there was no way to both operate on a group of rows while still returning a single value for every input row
估计window使用后行数并不会减少,如下sql会报错:
select sum(value2) over (partition by group2 order by value2) from df1 group by group2
报错:
org.apache.spark.sql.AnalysisException: expression 'df1.`value2`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; Project [sum2(value2) OVER (PARTITION BY group2 ORDER BY value2 ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#1301] +- Project [value2#38, group2#37, sum2(value2) OVER (PARTITION BY group2 ORDER BY value2 ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#1301, sum2(value2) OVER (PARTITION BY group2 ORDER BY value2 ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#1301] +- Window [sum2(value2#38, ch7Agg.Sum2@5d71b500, 0, 0) windowspecdefinition(group2#37, value2#38 ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS sum2(value2) OVER (PARTITION BY group2 ORDER BY value2 ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)#1301], [group2#37], [value2#38 ASC NULLS FIRST] +- Aggregate [group2#37], [value2#38, group2#37] +- SubqueryAlias `df1` +- Project [_1#34 AS group2#37, _2#35 AS value2#38] +- LocalRelation [_1#34, _2#35]
貌似window和group by不能同时使用
书有个好处就是比较系统,当你看到了一个知识点后,就可以上网去搜
RANGE frame
网页上的描述看的让人心惊肉跳还是不明白,但是一看下面的图就明白了
关于range frame和rows frame
刚才的语句可以改为
select product,category,revenue, max(revenue) over (PARTITION BY category order by revenue range ws between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING) - revenue as diff from productRevenue
结果一样
说明range和rows后面使用的范围写法是一样的
区别在于range是找取值范围,而rows是找行数范围
来看
select product,category,revenue, max(revenue) over (PARTITION BY category order by revenue range between 0 PRECEDING and 1 FOLLOWING) - revenue as diff from productRevenue
结果
+-------+--------+-------+----+ |product|category|revenue|diff| +-------+--------+-------+----+ | p1| c1| 1| 1| | p2| c1| 2| 0| | p3| c1| 4| 0| | p3| c3| 6| 0| | p2| c2| 3| 0| | p3| c2| 5| 0| +-------+--------+-------+----+
为什么只有第一行是1?
因为看revenue那一列,1和2相差1,和1 FOLLOWING导致2也在窗口的范围内,就是说,revenue的第一行和第二行的值相差1,而1 FOLLOWING说明了只有相差的值小于或等于1,才把这一行分到当前的窗口中
再说明白一点就是
如果当前行的值是1,那么只有取值为1到2的行才属于当前行的窗口
如果当前行的值是2,那么只有取值为2到3的行才属于当前行的窗口
如果当前行的值是4,那么只有取值为4到5的行才属于当前行的窗口
类推
而后面2和4相差2,4和6相差2等等,都不在1 FOLLOWING的范围内,这也进一步说明了range是基于取值的
再看
select product,category,revenue, max(revenue) over (PARTITION BY category order by revenue rows between 0 PRECEDING and 1 FOLLOWING) - revenue as diff from productRevenue
结果
+-------+--------+-------+----+ |product|category|revenue|diff| +-------+--------+-------+----+ | p1| c1| 1| 1| | p2| c1| 2| 2| | p3| c1| 4| 0| | p3| c3| 6| 0| | p2| c2| 3| 2| | p3| c2| 5| 0| +-------+--------+-------+----+
这里看revenuey列,因为4 6 5都是当前类别下最大的值,所以和冠军差别为0,其余都是和下一行进行差值比较
加群入久伴博客官方微信群
加入久伴官方微信群有啥优势?
1.文章内部资源由于时间久远可能失效,加群,群内可以第一时间反馈
2.网站中教程不一定适用任何人,加群可以第一时间咨询并解决您得疑惑
3.你游荡于互联网,也需要群体,加群一起互动交流,畅所欲言
扫描下方二维码即可加入交流讨论

虽然全部看完,但是没看懂,来自技术渣的悲哀啊!
第一次发表,自然要装一装门面,后面会写简单易懂的