Posts

Azure DataBricks Accessing Data Lake (Using Access Key)

Image
  Introduction: In this Blog post we are going to discuss about accessing Azure Data Lake Gen2.   How can we Access the Azure Data Lake Gen2: Access can be done by Ø   Using Storage Access key Ø   Using Shared access signature (SAS token) Ø   Using Service Principal Ø   Using Azure Active directory authentication pass-through Ø   Using unity catalog   In this post we are going to discuss about accessing Azure Data Lake Gent 2 by using Access key. Authenticate Data Lake with Access Key: ·          Each storage account comes with 2 access keys ·          Each access key is 512 bits ·          Access key gives full access of storage account ·          Conceder it as super user ·          Key can be rotated (re-generated)   Access Key Spark configuration: Here we take myschool as Azure Data Lake gen2 and there is a container within this data lake named bronze . The bronze containers have a csv file named school.csv   spark.conf.set(     "

Azure DataBricks Architecture Part-2 (DataBricks Cluster)

Image
  Introduction: In this Blog post we are going to discuss about DataBricks cluster type and creation options that need to be selected when creating DataBricks cluster.   Type of DataBricks Cluster: Mainly we can divide it into two types. ·          All-purpose Cluster ·          Job Cluster Difference between Them: All-purpose Cluster Job Cluster Created Manually Created By JOB Persisted (Can Terminated Manually) Non Persisted (Terminated when Job Ends) Suitable for Interactive Workload Suitable for Automated Work Load Shared among many users Isolated Just for JOB Expensive to run Cheaper to run   Note: We cannot Create Job cluster. It is automatically created when Job runs. Cluster Configuration option Details: We need to select those options when creating cluster.     ·          Single/Multi Node: In Single nod

Azure Data Bricks Architecture Part-1

Image
  Introduction:   We decide to provide several blog posts for learning Azure DataBricks step by step. This blog post is useful for beginners who want to start Azure DataBricks.   Azure DataBricks High Level Diagram:   Azure DataBricks is an analytics platform optimized for the Microsoft Azure cloud services platform. It provides a collaborative environment with Apache Spark-based analytics that enables big data processing, real-time analytics, and machine learning tasks. Here are some key features of Azure DataBricks: ·          Optimized Apache Spark Environment: It allows you to set up an Apache Spark environment quickly, with autoscaling and auto-termination to optimize costs. ·          Collaborative Workspace: It supports multiple languages like Python, Scala, R, Java, and SQL, and integrates with tools like GitHub and Azure DevOps for version control. ·          Machine Learning Capabilities: Azure DataBricks integrates with Azure Machine Learning to provide au

Vertipaq Engine – Column & Segment Elimination

Image
  Introduction As we are discussing about the power of Vertipaq Engine, in this post we are going to discuss about another beautiful technique that Vertipaq Engine maintain to make our query faster and that is Column Elimination and Segment Elimination.   Hope it will be interesting. Case Study We have a table which is imported into Vertipaq engine. Table Name: Order_Details   Column Elimination As Vertipaq Engine stores data as a columner fashion so it’s looks like.   Now we need to fetch data for Customer ‘ABC Company’ and want to know the Total Quntity the specified company purchased. For that we need only columns named Customer and Qty. All others columns are simply eliminated. It actually limit the data by eleminating the columns and called columns elemenation. Segment Elimination This is done by horizontal Partitioning.   For Power Pivot it takes 1 million per partition and for SSAS Tabular it takes 8 million per Partition. Now how it works.

Vertipaq Engine – Columnar Database

Image
  Introduction   In our previous post, we are discussing about Vertipaq Data compression one of the best feature of Vertipaq engine by Microsoft. In this post we are trying to discuss about Vertipaq Columnar Database. The ultimate feature of Vertipaq engine is to speed up the query performance. Hope it will be interesting.   What the Special in Vertipaq Engine Vertipaq is an in-memory columnar database. Being in-memory means that all of the data handled by a model reside in RAM.   How Database works with a Table To understand the columnar database we have to understand, how data is retrieved from a table in our traditional database like MS SQL Server. Let’s take an example of a table. Table Name: tbl_OrderDetails The data in a Database table stores in 8 KB page named data page. The 8 data page combined together to from an Extinct. When we search for a specified product to understand how many Quantity has been s

Vertipaq Engine – Data Compression

Image
Introduction Microsoft introduces a very powerful engine for fast query named Vertipaq . Which is used in the Power BI and SSAS Tabular model? Here we are not going to discuss about Power BI or SSIS Tabular model. Our main concern is to identify how the Vertipaq Engine works and how it provide Fastest Query. The Vertipaq Engine stores the table in a columnar format and established relation between columns. In this blog post we are going to discuss about Data Compression mechanism and try to explain it.   Case Study We have a table which is imported into Vertipaq engine. Table Name:  Order_Details     How Vertipaq works on this Table Vertipaq Compress Text in a Table Text takes lot of space in data base. So Vertipaq need to compress it. It maintain another table with distinct Text value and Id number like this. It is also called Dictionary/ Hash compression.   Product_ID_Dictionary              Value Encoding Here Vertipaq tries to convert Value into One Digit where as possible. TO u

Working with Spark – Spark RDD

Image
Introduction In my previous blog post, we are discussing about Hadoop Map Reduce. Now in this blog post we are going to discuss about Spark Fundamentals concept like RDD. RDD stands for  Resilient Distributed  Dateset , these are the elements that run and operate on multiple nodes to do parallel processing on a cluster.  Hope it will be interesting. What is RDD According to Apache Spark documentation -  " Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat ". RDDs are immutable elements, which means once we create an RDD we cannot change it. RDDs are fault tolerant as well, hence in case of any failure, they recover a