目录
- 背景
- 第一部分 Spark内存管理详解
- 第二部分 Spark参数说明
- 第三部分 Spark内存优化
- 第四部分 常见线上问题解决
- 参考文献及资料
背景
我们知道Spark计算框架中RDD(Resilient Distributed Dataset)是他的核心概念,RDD是一个只读数据模型,计算中只能对RDD进行转换生成新的RDD。新旧RDD之间就有了血缘关系,通常称为父子关系。
本文,我将用新的绘图带大家搞清楚究竟什么是宽依赖(ShuffleDependency),什么是窄依赖(NarrowDependency)
https://www.cnblogs.com/upupfeng/p/12344963.html
https://www.interhorse.cn/a/2301943088/
rdd之间的依赖关系
narrow dependences :父rdd 中的数据不被拆分
shuffle dependences : 父rdd中的数据被拆分
No. If two RDDs have the same partitioner, the join
will not cause a shuffle. You can see this in CoGroupedRDD.scala
:
1 | override def getDependencies: Seq[Dependency[_]] = { |
Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It’s possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).
This situation is still better than doing a shuffle, but it’s something to keep in mind. Co-location can improve performance, but is hard to guarantee.
Co-Located and Co-Partitioned RDDs
Co-Located指分布在内存的位置相同。如果两个rdd被相同的job分布到内存,且拥有相同partitioner,就会是co-located,能方便CoGroupedRDD函数,如cogroup和join等。
Co-Partitioned指拥有相同partitioner的rdd。
1 | //下面count执行后,rddA和rddB会是Co-Located。如果把rddA和rddB的注释去掉,则是Co-Partitioned |
参考文献及资料
1、Understanding Co-partitions and Co-Grouping In Spark,链接:https://amithora.com/understanding-co-partitions-and-co-grouping-in-spark/