Building an Intelligent Proctoring System with MediaPipe and YOLOv8

Introduction

Online examinations and remote hiring platforms have become essential in today’s distributed digital ecosystem. However, ensuring fairness, detecting suspicious behavior, and verifying candidate authenticity remain significant challenges. We developed an AI-driven proctoring system that leverages real-time computer vision using MediaPipe and YOLOv8 to address these concerns.

This intelligent system tracks eye movements and head orientation, detects multiple faces, and identifies the presence of mobile phones, all from a single video feed. It is optimized for hiring assessments and exams and performs efficiently without requiring cloud-based inference engines. The solution is open-source, scalable, and highly accurate across varied environments.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Overview of Components

Our system combines the strengths of two powerful models:

MediaPipe – For head pose estimation and eye tracking using facial landmarks.
YOLOv8 – This is for multi-face detection and mobile phone detection in video streams.

Each component identifies behaviour that might indicate cheating or violating guidelines during an exam or interview session.

Head and Eye Tracking using MediaPipe

MediaPipe Face Mesh offers 468 facial landmarks. These compute the head’s pitch, yaw, and roll, allowing us to determine how much the candidate moves or turns their head. Simultaneously, iris tracking provides insight into where the user is looking, which is useful for detecting eye movement patterns that may suggest distractions or off-screen focus.

For example, if the candidate consistently looks away from the screen or shows frequent head turns, these behaviors can be flagged for review.

Sample Code:

import cv2
import mediapipe as mp

mp_face_mesh = mp.solutions.face_mesh.FaceMesh()
cap = cv2.VideoCapture('video.mp4')

while cap.isOpened():
    ret, frame = cap.read()
    results = mp_face_mesh.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    if results.multi_face_landmarks:
        # Calculate pitch, yaw, roll
        # Track iris for eye movement
        pass

import cv2

import mediapipe as mp

mp_face_mesh = mp.solutions.face_mesh.FaceMesh()

cap = cv2.VideoCapture('video.mp4')

while cap.isOpened():

ret, frame = cap.read()

results = mp_face_mesh.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

if results.multi_face_landmarks:

# Calculate pitch, yaw, roll

# Track iris for eye movement

pass

Face Detection using YOLOv8

The YOLOv8 model (we used yolov8n.pt) provides fast and accurate object detection. We used it to detect:

Faces – To identify when more than one person appears in the frame.
Mobile Phones – To flag if the candidate uses or has access to a phone during the test.

The model processes all video frames and saves annotated evidence whenever a person and a cell phone are detected simultaneously.

Sample Code:

from ultralytics import YOLO
import cv2
model = YOLO('yolov8n.pt')
frame = cv2.imread('frame.jpg')
results = model(frame)[0]
for box in results.boxes:
    cls_id = int(box.cls)
    label = model.names[cls_id]
    if label in ['person', 'cell phone']:
        # Save frame or draw bounding boxes
        Pass

from ultralytics import YOLO

import cv2

model = YOLO('yolov8n.pt')

frame = cv2.imread('frame.jpg')

results = model(frame)[0]

for box in results.boxes:

cls_id = int(box.cls)

label = model.names[cls_id]

if label in ['person', 'cell phone']:

# Save frame or draw bounding boxes

Pass

This allows for powerful, lightweight on-device detection without relying on third-party APIs or cloud services. Only relevant frames with multiple faces or phone presence are stored as evidence, reducing storage overhead.

Technical Challenges and Optimizations

While developing this proctoring solution, a few key challenges had to be addressed for efficiency and accuracy:

Frame Sampling
Analysing every frame in a long video is resource-heavy. To optimize, the system can sample fewer frames per second (e.g., 5 FPS instead of 30), reducing computation without missing critical evidence.
Memory Management
Long videos risk memory overflow if frames aren’t released promptly. To counter this, processed frames are freed immediately, and lightweight buffers are used to keep resource usage low.
Threshold Calibration
Movement thresholds were carefully tuned to reduce false positives:

Head Yaw/Pitch: Movements beyond 40° are flagged.
Eye Movement: Normalized iris displacement beyond 0.4 is flagged.

These optimizations ensured the system remained accurate, scalable, and efficient even for extended video sessions.

Conclusion

This AI-based proctoring system merges the efficiency of MediaPipe and the power of YOLOv8 to create a holistic solution for monitoring online exams and interviews.

Its modular design, real-time performance, and high accuracy make it suitable for integration into modern hiring platforms and educational tools.

Whether you’re building a next-gen interview bot, an online assessment tool, or an AI invigilator for remote exams, this system can be a plug-and-play proctoring module.

Drop a query if you have any questions regarding MediaPipe or YOLOv8 and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As a Microsoft Solutions Partner, AWS Advanced Tier Training Partner, and Google Cloud Platform Partner, CloudThat has empowered over 850,000 professionals through 600+ cloud certifications winning global recognition for its training excellence including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 12 awards in the last 8 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, IoT, and cutting-edge technologies like Gen AI & AI/ML. It has delivered over 500 consulting projects for 250+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What are pitch, yaw, and roll?

ANS: –

Pitch: Up and down movement of the head.
Yaw: Left and right rotation of the head.
Roll: Tilting of the head sideways.

2. Why combine MediaPipe and YOLOv8?

ANS: – MediaPipe is extremely efficient for tracking fine facial movements like gaze and head orientation, while YOLOv8 excels in real-time object detection. Together, they offer a comprehensive monitoring solution.

3. Can this be used in real-time scenarios?

ANS: – Yes. With basic threading, reduced resolution, and frame skipping strategies, this solution can run in near real-time even on modest hardware.

WRITTEN BY Ahmad Wani

Ahmad works as a Research Associate in the Data and AIoT Department at CloudThat. He specializes in Generative AI, Machine Learning, and Deep Learning, with hands-on experience in building intelligent solutions that leverage advanced AI technologies. Alongside his AI expertise, Ahmad also has a solid understanding of front-end development, working with technologies such as React.js, HTML, and CSS to create seamless and interactive user experiences. In his free time, Ahmad enjoys exploring emerging technologies, playing football, and continuously learning to expand his expertise.