SQL Knowledge Bank

Posts

Azure DataBricks Architecture Part-2 (DataBricks Cluster)

April 09, 2024

Introduction: In this Blog post we are going to discuss about DataBricks cluster type and creation options that need to be selected when creating DataBricks cluster. Type of DataBricks Cluster: Mainly we can divide it into two types. · All-purpose Cluster · Job Cluster Difference between Them: All-purpose Cluster Job Cluster Created Manually Created By JOB Persisted (Can Terminated Manually) Non Persisted (Terminated when Job Ends) Suitable for Interactive Workload Suitable for Automated Work Load Shared among many users Isolated Just for JOB Expensive to run Cheaper to run Note: We cannot Create Job cluster. It is automatically created when Job runs. Cluster Configuration option Details: We need to select ...

Azure Data Bricks Architecture Part-1

April 05, 2024

Introduction: We decide to provide several blog posts for learning Azure DataBricks step by step. This blog post is useful for beginners who want to start Azure DataBricks. Azure DataBricks High Level Diagram: Azure DataBricks is an analytics platform optimized for the Microsoft Azure cloud services platform. It provides a collaborative environment with Apache Spark-based analytics that enables big data processing, real-time analytics, and machine learning tasks. Here are some key features of Azure DataBricks: · Optimized Apache Spark Environment: It allows you to set up an Apache Spark environment quickly, with autoscaling and auto-termination to optimize costs. · Collaborative Workspace: It supports multiple languages like Python, Scala, R, Java, and SQL, and integrates with tools like GitHub and Azure DevOps for version control. · ...

Vertipaq Engine – Column & Segment Elimination

August 28, 2020

Introduction As we are discussing about the power of Vertipaq Engine, in this post we are going to discuss about another beautiful technique that Vertipaq Engine maintain to make our query faster and that is Column Elimination and Segment Elimination. Hope it will be interesting. Case Study We have a table which is imported into Vertipaq engine. Table Name: Order_Details Column Elimination As Vertipaq Engine stores data as a columner fashion so it’s looks like. Now we need to fetch data for Customer ‘ABC Company’ and want to know the Total Quntity the specified company purchased. For that we need only columns named Customer and Qty. All others columns are simply eliminated. It actually limit the data by eleminating the columns and called columns elemenation. Segment Elimination This is done by horizontal Partitioning. For Power Pivot it takes 1 million per partition and for SSAS Tabular it takes 8 million per Partit...

Vertipaq Engine – Columnar Database

August 17, 2020

Introduction In our previous post, we are discussing about Vertipaq Data compression one of the best feature of Vertipaq engine by Microsoft. In this post we are trying to discuss about Vertipaq Columnar Database. The ultimate feature of Vertipaq engine is to speed up the query performance. Hope it will be interesting. What the Special in Vertipaq Engine Vertipaq is an in-memory columnar database. Being in-memory means that all of the data handled by a model reside in RAM. How Database works with a Table To understand the columnar database we have to understand, how data is retrieved from a table in our traditional database like MS SQL Server. Let’s take an example of a table. Table Name: tbl_OrderDetails The data in a Database table stores in 8 KB page named data page. The 8 data page combined together to from an Extinct. When we search for a specified product to understand how many ...

Vertipaq Engine – Data Compression

August 15, 2020

Introduction Microsoft introduces a very powerful engine for fast query named Vertipaq . Which is used in the Power BI and SSAS Tabular model? Here we are not going to discuss about Power BI or SSIS Tabular model. Our main concern is to identify how the Vertipaq Engine works and how it provide Fastest Query. The Vertipaq Engine stores the table in a columnar format and established relation between columns. In this blog post we are going to discuss about Data Compression mechanism and try to explain it. Case Study We have a table which is imported into Vertipaq engine. Table Name: Order_Details How Vertipaq works on this Table Vertipaq Compress Text in a Table Text takes lot of space in data base. So Vertipaq need to compress it. It maintain another table with distinct Text value and Id number like this. It is also called Dictionary/ Hash compression. Product_ID_Dictionary Value Encoding H...

Working with Spark – Spark RDD

July 09, 2020

Introduction In my previous blog post, we are discussing about Hadoop Map Reduce. Now in this blog post we are going to discuss about Spark Fundamentals concept like RDD. RDD stands for Resilient Distributed Dateset , these are the elements that run and operate on multiple nodes to do parallel processing on a cluster. Hope it will be interesting. What is RDD According to Apache Spark documentation - " Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat ". RDDs are immutable elements, which means once we create an RDD we cannot change it. RDDs are fault tolerant as well, hence in case of any failure, they ...

Search This Blog

SQL Knowledge Bank - Azure