Spark中RDD的宽窄依赖

背景

我们知道Spark计算框架中RDD（Resilient Distributed Dataset）是他的核心概念，RDD是一个只读数据模型，计算中只能对RDD进行转换生成新的RDD。新旧RDD之间就有了血缘关系，通常称为父子关系。

本文，我将用新的绘图带大家搞清楚究竟什么是宽依赖（ShuffleDependency），什么是窄依赖（NarrowDependency）

https://www.cnblogs.com/upupfeng/p/12344963.html

https://www.interhorse.cn/a/2301943088/

rdd之间的依赖关系

narrow dependences ：父rdd 中的数据不被拆分

shuffle dependences ：父rdd中的数据被拆分

No. If two RDDs have the same partitioner, the join will not cause a shuffle. You can see this in CoGroupedRDD.scala:

override def getDependencies: Seq[Dependency[_]] = {
  rdds.map { rdd: RDD[_ <: Product2[K, _]] =>
    if (rdd.partitioner == Some(part)) {
      logDebug("Adding one-to-one dependency with " + rdd)
      new OneToOneDependency(rdd)
    } else {
      logDebug("Adding shuffle dependency with " + rdd)
      new ShuffleDependency[K, Any, CoGroupCombiner](rdd, part, serializer)
    }
  }
}

Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It’s possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located).

This situation is still better than doing a shuffle, but it’s something to keep in mind. Co-location can improve performance, but is hard to guarantee.

Co-Located and Co-Partitioned RDDs

Co-Located指分布在内存的位置相同。如果两个rdd被相同的job分布到内存，且拥有相同partitioner，就会是co-located，能方便CoGroupedRDD函数，如cogroup和join等。

Co-Partitioned指拥有相同partitioner的rdd。

//下面count执行后，rddA和rddB会是Co-Located。如果把rddA和rddB的注释去掉，则是Co-Partitioned
val rddA = a.partitionBy(partitioner)
rddA.cache()
val rddB = b.partitionBy(partitioner)
rddB.cache()
val rddC = a.cogroup(b)
//rddA.count()
//rddB.count()
rddC.count()

参考文献及资料

1、Understanding Co-partitions and Co-Grouping In Spark，链接：https://amithora.com/understanding-co-partitions-and-co-grouping-in-spark/

目录

背景

参考文献及资料