Spark’s performance advantage over MapReduce is due to Spark’s In-Memory Persistence and Memory Management Rather than writing to disk between each pass through the data, Spark has the option of keeping the data on the executors loaded into memory. That way, the data on each partition is available in-memory each time it needs to be accessed.
Spark offers three options for memory management: in-memory as deserialized data, in-memory as serialized data, and on disk. Each has different space and time advantages:
- In memory as deserialized Java objects
- As serialized data
- On disk
In memory as deserialized Java objects
The most intuitive way to store objects in RDDs is as the original deserialized Java objects that are defined by the driver program. This form of in-memory storage is the fastest since it reduces serialization time; however, it may not be the most memory efficient, since it requires the data to be stored as objects.
As serialized data
Spark objects are converted into streams of bytes as they are moved around the network using the standard Java serialization library. This approach may be slower since serialized data is more CPU-intensive to read than deserialized data,
but it is more memory efficient, since it allows the user to choose a more efficient representation. While Java serialization is more efficient than full objects, Kryo serialization can be much better than the java serialization
On disk:
RDDs, whose partitions are too large to be stored in RAM on each of the executors, can be written to disk. This strategy is obviously slower for repeated computations but can be more fault-tolerant for long sequences of transformations, and maybe the only feasible option for enormous computations.
The persist() function in the RDD class lets the user control how the RDD is stored. By default, persist() stores an RDD as deserialized objects in memory, but the user can pass one of nthe umerous storage options to the persist() function to control how the RDD is stored. We will cover the different options for RDD reuse
“Types of RDD Reuse: Cache, Persist, Checkpoint, Shuffle Files” . When persisting RDDs, the default implementation of RDDs evicts the least recently used partition (called LRU caching) if the space it takes is required to compute or to cache a new partition. However, you can change this behavior and control Spark’s memory prioritization with the persistencePriority() function in the RDD class.
No comments:
Post a Comment