Crawl Schemas in AWS Glue without Crawlers

What is AWS Glue?

AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between various data sources for analytics and data processing purposes. It provides a centralized data catalog and can automatically discover and profile the data, making it easier for data analysts and scientists to find the data they need for their work. We use this service to do the task.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

What is AWS Glue Crawler?

AWS Glue Crawler is a tool that automatically discovers and catalogs data from various sources such as databases, data warehouses, and data lakes.

It analyses the data structure, schema, and metadata to create a unified and searchable data catalog. We don’t use this AWS Glue Crawler. Rather we will do it with Pandas library with the help of AWS Lambda or AWS Glue ETL Jobs.

What is AWS Glue Catalog?

AWS Glue Catalog is a metadata repository that stores references to data sources, tables, and schemas. It provides a unified view of all data assets across different services and can be used to query and access data across other data stores.

What is AWS Glue Database?

AWS Glue Database is a logical container that holds related tables in the AWS Glue Catalog. A database in AWS Glue Catalog is similar to a database in a traditional relational database management system (RDBMS).

What is AWS Glue Table?

AWS Glue Table is a data structure that defines the data schema in a particular data store. Tables are stored in the AWS Glue Catalog databases and can be used by ETL jobs to transform data and load it into other systems.

What is Amazon S3 (Simple Storage Service)?

Amazon S3 (Simple Storage Service) is a cloud storage service that provides scalable, durable, and secure storage for data. Amazon S3 can be used as a data store for various data types and can be accessed through a simple web interface or programmatically using APIs. It is commonly used as a data lake to store raw data for analytics and processing.

Steps to get Schema

Step 1 – Get the dataset from the below link.

Kaggle

After downloading, extract the folder and upload the file(winequality-red.csv) to Amazon S3 like, shown in the below image

glue

Step 2 – Create AWS Glue Database, testing-database

Step 3 – Create an AWS Lambda

step3

step3b

Give necessary permission for Lambda execution such as Amazon S3 access, AWS Glue Service, etc.

Enter this code in AWS Lambda

import pandas as pd
import boto3
import io
import json

def lambda_handler(event, context):
    s3_bucket = 'demo-bucket-for-schema'
    s3_key = 'winequality-red.csv'
    database_name = 'testing-database'
    table_name = 'testing-table'
    
    s3 = boto3.client('s3')
    response = s3.get_object(Bucket=s3_bucket, Key=s3_key)
    df = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='utf8')
    print(df)
    glue_columns = []
    for column in df.columns:
        column_type = df[column].dtype.name
        if column_type == 'object':
            glue_column = {'Name': column, 'Type': 'string'}
        elif column_type == 'int64':
            glue_column = {'Name': column, 'Type': 'bigint'}
        elif column_type == 'float64':
            glue_column = {'Name': column, 'Type': 'double'}
        else:
            raise ValueError(f"Unsupported column type: {column_type}")
        glue_columns.append(glue_column)
    print(glue_columns)
    glue = boto3.client('glue')
    table_input = {
        'Name': table_name,
        'StorageDescriptor': {
            'Columns': glue_columns,
            'Location': f's3://{s3_bucket}/{s3_key}',
            'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
            'SerdeInfo': {
                'SerializationLibrary': 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
                'Parameters': {
                    'field.delim': ','
                }
            }
        },
        'PartitionKeys': [],
        'TableType': 'EXTERNAL_TABLE'
    }
    glue.create_table(DatabaseName=database_name, TableInput=table_input)

import pandas as pd

import boto3

import io

import json

def lambda_handler(event, context):

s3_bucket = 'demo-bucket-for-schema'

s3_key = 'winequality-red.csv'

database_name = 'testing-database'

table_name = 'testing-table'

s3 = boto3.client('s3')

response = s3.get_object(Bucket=s3_bucket, Key=s3_key)

df = pd.read_csv(io.BytesIO(response['Body'].read()), encoding='utf8')

print(df)

glue_columns = []

for column in df.columns:

column_type = df[column].dtype.name

if column_type == 'object':

glue_column = {'Name': column, 'Type': 'string'}

elif column_type == 'int64':

glue_column = {'Name': column, 'Type': 'bigint'}

elif column_type == 'float64':

glue_column = {'Name': column, 'Type': 'double'}

else:

raise ValueError(f"Unsupported column type: {column_type}")

glue_columns.append(glue_column)

print(glue_columns)

glue = boto3.client('glue')

table_input = {

'Name': table_name,

'StorageDescriptor': {

'Columns': glue_columns,

'Location': f's3://{s3_bucket}/{s3_key}',

'InputFormat': 'org.apache.hadoop.mapred.TextInputFormat',

'OutputFormat': 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',

'SerdeInfo': {

'SerializationLibrary': 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',

'Parameters': {

'field.delim': ','

}

'PartitionKeys': [],

'TableType': 'EXTERNAL_TABLE'

}

glue.create_table(DatabaseName=database_name, TableInput=table_input)

Step 4 – Create a Layer to add Pandas Library

step4

Step 5 – Upload the Pandas library zip file to Amazon S3 and copy the URL of it and paste it here.

If you need reference how to download Pandas library then click here How to Install Pandas on AWS Lambda Function – YouTube

Scroll down in the AWS Lambda page to last, you will find a Layer to add and add the Pandas layer which was created earlier.

step5

Step 6 – Execute Lambda and if you get any error regarding time out then increase it in configuration to 1 minute.

step7

Step 7 – Refresh the table and you’ll find the testing-table in AWS Glue Database

step8

step9

As we can see the schema got created in the new table

Now go to AWS Glue ETL jobs->create a Job->Click on Script->Enter the Same code in the Script and click on Save.

Run the job to see the results in AWS Glue Table.

Conclusion

We can easily create the schema with the help of the Pandas library without needing the AWS Glue Crawler, as we have seen above. We can use AWS Lambda or AWS Glue ETL Jobs to do this task. This library automatically creates the schema based on the CSV data stored in Amazon S3.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. Does Pandas Library open source?

ANS: – Yes. Pay for only AWS Lambda or AWS Glue ETL Jobs based on number of crawls you do.

2. Which one to use Pandas Library or AWS Glue Crawler?

ANS: – Well, it depends on various factors such as resources, cost, use case, etc. if you don’t need to worry about anything then just go with AWS Glue Crawler as it does everything for you.

WRITTEN BY Suresh Kumar Reddy

Suresh is a highly skilled and results-driven Generative AI Engineer with over three years of experience and a proven track record in architecting, developing, and deploying end-to-end LLM-powered applications. His expertise covers the full project lifecycle, from foundational research and model fine-tuning to building scalable, production-grade RAG pipelines and enterprise-level GenAI platforms. Adept at leveraging state-of-the-art models, frameworks, and cloud technologies, Suresh specializes in creating innovative solutions to address complex business challenges.