Posts

Showing posts from April, 2024

Azure DataBricks Accessing Data Lake (Using Access Key)

Image
  Introduction: In this Blog post we are going to discuss about accessing Azure Data Lake Gen2.   How can we Access the Azure Data Lake Gen2: Access can be done by Ø   Using Storage Access key Ø   Using Shared access signature (SAS token) Ø   Using Service Principal Ø   Using Azure Active directory authentication pass-through Ø   Using unity catalog   In this post we are going to discuss about accessing Azure Data Lake Gent 2 by using Access key. Authenticate Data Lake with Access Key: ·          Each storage account comes with 2 access keys ·          Each access key is 512 bits ·          Access key gives full access of storage account ·          Conceder it as super user ·          Key can be rotated (re-generated)   Access Key Spark configuration: Here we take myschool as Azure Data Lake gen2 and there is a container within this data lake named bronze . The bronze containers have a csv file named school.csv   spark.conf.set(     "

Azure DataBricks Architecture Part-2 (DataBricks Cluster)

Image
  Introduction: In this Blog post we are going to discuss about DataBricks cluster type and creation options that need to be selected when creating DataBricks cluster.   Type of DataBricks Cluster: Mainly we can divide it into two types. ·          All-purpose Cluster ·          Job Cluster Difference between Them: All-purpose Cluster Job Cluster Created Manually Created By JOB Persisted (Can Terminated Manually) Non Persisted (Terminated when Job Ends) Suitable for Interactive Workload Suitable for Automated Work Load Shared among many users Isolated Just for JOB Expensive to run Cheaper to run   Note: We cannot Create Job cluster. It is automatically created when Job runs. Cluster Configuration option Details: We need to select those options when creating cluster.     ·          Single/Multi Node: In Single nod

Azure Data Bricks Architecture Part-1

Image
  Introduction:   We decide to provide several blog posts for learning Azure DataBricks step by step. This blog post is useful for beginners who want to start Azure DataBricks.   Azure DataBricks High Level Diagram:   Azure DataBricks is an analytics platform optimized for the Microsoft Azure cloud services platform. It provides a collaborative environment with Apache Spark-based analytics that enables big data processing, real-time analytics, and machine learning tasks. Here are some key features of Azure DataBricks: ·          Optimized Apache Spark Environment: It allows you to set up an Apache Spark environment quickly, with autoscaling and auto-termination to optimize costs. ·          Collaborative Workspace: It supports multiple languages like Python, Scala, R, Java, and SQL, and integrates with tools like GitHub and Azure DevOps for version control. ·          Machine Learning Capabilities: Azure DataBricks integrates with Azure Machine Learning to provide au