Spark is super-slow for all wide data (when there are >15kb columns and >15kb rows). Most of the genomics/transcriptomic data is wide because number of genes is usually >20kb and number of samples ass well. Very popular GTEX dataset is a good example ( see for instance RNA-Seq data athttps://storage.googleapis.com/gtex_analysis_v7/rna_seq_datawhere gct is just a .tsv file with two comments in the beginning). Everything done in wide tables (even simple "describe" functions applied to all the genes-columns) either takes hours or gets frozen (because of lost executors) irrespective of memory and numbers of cores. While the same operations work fast (minutes) and well with pure pandas (without any spark involved).
Subscribe to:
Post Comments (Atom)
Spark Memory Management
Spark’s performance advantage over MapReduce is due to Spark’s In-Memory Persistence and Memory Management Rather than writing to disk ...
-
Abinitio on unix to sqlserver on windows conn requirements This article details the steps required to connect from...
-
Yarn Vs Zookeeper (in brief) YARN is the resource manager in Hadoop-2 architecture. It is similar to Mesos, as a role: Given a cl...
-
Partitioning: An Anecdote for performance Partitioning are of two types in hive, 1. Dynamic and Static partitioning Whe...
No comments:
Post a Comment