Blog

Aspects that a manager should consider when inheriting a team

Background It is most ideal when a manager gets to build a product from scratch. When this happens a manager knows the vision, hires the team and it a might a little bit easier to lead as compared to inheriting a team. I had a similar experiences where I got a chance to build teams.…

Apache Kafka

In this blog, I will share my experience of working with Apache Kafka. I will talk about the different use-cases, basic architecture followed by a quick example. In one the companies that I worked for, I was asked to replace the existing message queueing framework. The existing solution we were using was not scaling. Apache…

Apache Mesos Framework

Apache Mesos is a cluster manager that provides efficient resource sharing across distributed applications. One of the advantages of Apache Mesos is linear scale. Companies like Twitter and Airbnb have utilized Mesos and created their own framework on top of it. Let’s take a couple of examples: Mesos can also be used as a cluster…

Managing with a Growth Mindset

I joined LinkedIn few months back and had the opportunity to attend one of the growth mindset trainings. It was a very good training especially from the point of view of a manager. I generally believe that smart people find ways to be good at a job given to them. This is also true when…

HDFS Formats: Parquet vs AVRO

AVRO Row-based storage format  Its schema is also stored with it robust support for data schemas that changes over time, i.e. schema evolution.  Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub-record. When to use Data from the landing zone is usually…

Hadoop Distributed File System Introduction

Introduction Built for Scale Fault-tolerant and is designed to be deployed on low-cost hardware Hundreds or thousands of server machines, each storing part of the file system’s data. Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS. Designed for batch processing rather than interactive use by users.  High…

Spark Introduction

Introduction Fast and general-purpose cluster computing system.  It provides high-level APIs in Java, Scala, Python, etc.  Core data abstraction is the Resilient Distributed Dataset (RDD) Abstraction which provides an efficient data sharing between computations It automatically distributes the data across the cluster and parallelizes the required operations.  Integrates with many storage systems (e.g., HDFS, Cassandra,…


Follow My Blog

Get new content delivered directly to your inbox.