Search This Blog

Wednesday, 30 August 2017

Choose your data storage format In Hadoop Eco System

                  Choose your data storage format:

Shall I go with Text or Avro or ORC or Sequence or Parquet formats,
guys believe me, it all relies on the type and size of data you are working on

Tools Compatibility:
Sometimes depend on compatibility of the tools you are working with
Impala does not understand ORC, --> then choose parquet or RC format

Memory Perspective:
1. Parquet/ORC with snappy compression downsizes the file to almost quarter.
2. Avro with deflate compression downsizes the files to almost one fourth.
3. Some compression codecs will not let the file splittable, which kills the very purpose of HDFS.
4. Sequence, Avro, Parquet, ORC offer splitability regardless of the compression codec.
5. If you go with text or csv format, parsing leads to compromise with retrieval time
6. Sequence file format is mainly designed to exchange data betwen MR jobs.

Querying Speed:
Columnar data formats like Parquet and ORC offer an advantage (in terms of querying speed)
when you have many columns but only need a few of those columns for your analysis since Parquet
and ORC increase the speed at which the queries are performed. However, that
advantage can be levied if you still need all the columns for use cases such as search. in which case you could decide based on your use case.

Schema Evolution 
When the underlying file structure has changed, for instance it can be
data type of a column, addition/removal of columns,altering of columns,
Textfile : wont store the schema
Parquet/Avro : stores the schema,
Parquet: only lets the addition of new columns at the end of columns and it doesn’t handle removal of columns,
Avro : is quiet generous , lets addition, deletion, and renaming of multiple columns

Now choose the format based on your projects nature from schema evolution perspective

Scenarios:
if your use case is to write fast and you seldom query the huge datasets, go for text formatif
if your use case is retrieve the data fast , go for columnar format, here writing time would be compromised due to some extra processing
if your use case is schema evolving, go for avro


peformed tests on the below five data formats against hive and impala
– Text
– Sequence
– Avro
– Parquet
– ORC

land on the below url for results of the above tests
http://www.svds.com/dataformats/







Spark Memory Management

 Spark’s performance advantage over MapReduce  is due to Spark’s In-Memory Persistence and Memory Management Rather than writing to disk ...