Cloud Computing, Cloud Data Science

4 Mins Read

Exploratory Data Analysis EDA on Databricks

Voiced by Amazon Polly

The Cambridge Analytica- Facebook scandal of 2018 exposed the misuse of information of over 81 million users. It is simply beyond understanding how these datasets were collected, stored, analyzed, and utilized to identify patterns that were predominantly used for political advertising!

When you think of data, usually an excel comes into the picture, which has alphanumerics in the rows and columns t. However, in recent days, the number of data companies are handling cannot fit into an excel file or any other daily usage BI tools. There must be an alternate!

The answer of Data Science.

Customized Cloud Solutions to Drive your Business Success

  • Cloud Migration
  • Devops
  • AIML & IoT
Know More

1. Introduction to Exploratory Data Analysis

EDA (Exploratory Data Analysis) is the stepping stone of Data Science, and the process involves investigating data and discovering underlying patterns in data. The EDA process is a step-by-step approach from identifying the types of variables to summarizing data in the form of visual graphs.

Let us deep dive and get into the technical nitty-gritty.

2. Why Databricks?

Generally, data bricks are preferable when the data scale is too large. Dealing with small datasets can be managed by Pandas library; while dealing with larger data sets, there is a need for parallel processing to compute scientific calculations much faster. Apache Spark Community provides pyspark.sql.functions to collaborate the use of Spark and Python.  It helps us work with RDD (Resilient Distributed Dataset) and DataFrames at a much faster pace.

Here is the step-by-step approach of EDA:

  1. Importing data in data bricks
  2. Descriptive Statistics
  3. Missing value imputation
  4. Graphical representation

Let us look at each one of these individually.

3. Importing data in data bricks

  • Small .csv or data files can be imported to DBFS using the UI.
    Data Bricks
    Data Bricks
    Reading data on cluster Nodes with SPARK is preferable for large datasets

4. Descriptive Statistics

  • Summary statistics or descriptive statistics summarize and formulate characteristics of a data set from a sample or the population dataset. The first idea is to categorize the type of data a column contains. Refer to the below chart for more ideas on the type of variables and values contained.
    Databricks
  • The first step is to do quantitative research of how the data is, for example, this may include operations such as deriving the average of one variable or relations between two variables.
    Databricks

    The query includes count, mean, standard deviation, min, Max

5. Missing value imputation

  1. Dropping or Imputing Rows with missing values
    -> Handling missing values in a data frame mostly depends upon the type of data present. If in a particular row most of the columns are missing the ideal way to handle missing values is to drop the row. If considerably a smaller number of columns are missing in a particular row we can replace them with mean median and mode depending upon the type of the variables.
    -> Visit this official documentation of Spark for more information on the query.
  2. Imputation method for categorical columns:
    -> Replacing missing values in a categorical column works differently from that of a numeric value. While creating categorical columns the missing values can be imputed with the most frequent category that is missing. If the total amount of missing values is large enough, it can be replaced with a new category such as “not specified”.
    -> The pros of doing this is there is less data loss and it negates the loss of data by adding a new value
  3. Using Algorithms that support missing values:
    -> There are other advanced imputation methods such as KNNImputer, and Last observation carried forward (LOCF).
    Visit this documentation for an implementation guide. 
    -> Other classification and regression algorithms can also be used to predict the missing values in a data set such as Naïve Bayes and Random Forest. They work best on categorical and non-linear data.
    Here are some links that can help you:
    1.9. Naive Bayes — scikit-learn 1.0.2 documentationsklearn.ensemble.RandomForestClassifier — scikit-learn 1.0.2 documentation

6. Graphical Representation

  1. Databricks supports various visualizations out of the box using the display and displays HTML functions. A display function along with the data frame would give a generalized profiling visualization of the entire dataset.

    databricks

    The display function supports a rich set of graphs for visualization.
    databricks
    Refer to the below table for implementing graphs and visualizing data based on the data type of the columns:
    databricks

 

7. Conclusion

When there is a need to process and transform massive quantities of data and explore the underlying patterns in data through machine learning models, Databricks is a go-to tool. It works well with both large and small datasets. In addition, it uses parallel processing to give faster execution on larger datasets, making the EDA process more simplified.

It has more effortless connectivity with Amazon S3; with just one line of code, you can import datasets and store the resultant datasets back to S3.

Get your new hires billable within 1-60 days. Experience our Capability Development Framework today.

  • Cloud Training
  • Customized Training
  • Experiential Learning
Read More

About CloudThat

CloudThat is a leading provider of Cloud Training and Consulting services with a global presence in India, the USA, Asia, Europe, and Africa. Specializing in AWS, Microsoft Azure, GCP, VMware, Databricks, and more, the company serves mid-market and enterprise clients, offering comprehensive expertise in Cloud Migration, Data Platforms, DevOps, IoT, AI/ML, and more.

CloudThat is the first Indian Company to win the prestigious Microsoft Partner 2024 Award and is recognized as a top-tier partner with AWS and Microsoft, including the prestigious ‘Think Big’ partner award from AWS and the Microsoft Superstars FY 2023 award in Asia & India. Having trained 850k+ professionals in 600+ cloud certifications and completed 500+ consulting projects globally, CloudThat is an official AWS Advanced Consulting Partner, Microsoft Gold Partner, AWS Training PartnerAWS Migration PartnerAWS Data and Analytics PartnerAWS DevOps Competency PartnerAWS GenAI Competency PartnerAmazon QuickSight Service Delivery PartnerAmazon EKS Service Delivery Partner AWS Microsoft Workload PartnersAmazon EC2 Service Delivery PartnerAmazon ECS Service Delivery PartnerAWS Glue Service Delivery PartnerAmazon Redshift Service Delivery PartnerAWS Control Tower Service Delivery PartnerAWS WAF Service Delivery PartnerAmazon CloudFront Service Delivery PartnerAmazon OpenSearch Service Delivery PartnerAWS DMS Service Delivery PartnerAWS Systems Manager Service Delivery PartnerAmazon RDS Service Delivery PartnerAWS CloudFormation Service Delivery PartnerAWS ConfigAmazon EMR and many more.

WRITTEN BY Bineet Singh Kushwah

Bineet Singh Kushwah works as Associate Architect at CloudThat. His work revolves around data engineering, analytics, and machine learning projects. He is passionate about providing analytical solutions for business problems and deriving insights to enhance productivity. In a quest to learn and work with recent technologies, he spends the most time on upcoming data science trends and services in cloud platforms and keeps up with the advancements.

Share

Comments

    Click to Comment

Get The Most Out Of Us

Our support doesn't end here. We have monthly newsletters, study guides, practice questions, and more to assist you in upgrading your cloud career. Subscribe to get them all!