# Batch vs Real-Time vs Serverless: Inference Types You Must Know for the AWS Certified AI Practitioner (AIF-C01)

> If you can quickly map an AI workload to the right inference type (batch, real-time, async, serverless), you’ll score easy exam points—and design cheaper, faste

- Author: Jamie Wright (Founder at Upcert.io)
- Date: 2026-01-16
- Reading time: 8 min
- Tags: AWS, AIF-C01
- Category: Domain 1: Fundamentals of AI and ML

---


# Batch vs Real-Time vs Serverless: Inference Types You Must Know for the AWS Certified AI Practitioner (AIF-C01)

*If you can quickly map an AI workload to the right inference type (batch, real-time, async, serverless), you’ll score easy exam points—and design cheaper, faster systems at work.*

## Why “Inference” Matters (for the AIF-C01 Exam and Real AWS Projects)

Inference is the moment your model stops being a science project and starts being useful.

Training is where you teach a model patterns (like showing it thousands of labeled photos). Inference is where you ask it to do the job in the real world (like: ‘is this photo a cat?’). On the AIF-C01 exam, that distinction matters because AWS wants you to recognize what kind of system you’re building: a one-time offline scoring job, a snappy app experience, or something in between.

Here’s the practical punchline: inference is where latency, cost, and user experience show up—loudly. A model that’s 2% more accurate doesn’t help much if it takes 12 seconds to respond and users abandon the page. And a ‘cheap’ deployment isn’t actually cheap if you keep a big endpoint running 24/7 just to handle a few requests per hour.

Think of inference like food service. A sit-down restaurant (real-time) is great when customers are waiting at the table. A catering order (batch) is great when you know you’ll serve 500 meals at 6pm. A food truck (serverless) is great when crowds come and go unpredictably.

The exam loves this kind of mapping. If a question describes “interactive,” “immediate,” “low-latency,” or “customer-facing,” you should hear a little bell ring for real-time. If it says “overnight,” “offline,” “large dataset,” or “backfill,” batch should jump to mind.

Once you can label the workload correctly, the architecture choices (and the right AWS options) get much easier.

## Basic AI Concepts in Plain Language: Model, Training vs Inference, Endpoint, Latency

A lot of AI jargon is just normal software ideas wearing a lab coat.

Model: the packaged ‘brain’ you can run. It’s basically a file (plus code) that takes input and returns output. Example: you send customer text; it returns ‘positive’ or ‘negative’ sentiment.

Training vs inference: training is the learning phase; inference is the using phase. Training is like studying for a test with an answer key. Inference is like taking the test—no hints, just answers.

Endpoint: a network address you can call to get predictions. If your app needs to ask the model questions all day long, you typically expose the model behind an endpoint so your app can send requests and get responses.

Latency: how long it takes to get an answer back. People often mix up throughput (how many predictions per second) and latency (how fast a single prediction returns). On the exam, latency is usually the deciding factor for whether a workload needs real-time inference.

A quick real-world picture: imagine you’re building a checkout flow that flags fraud. You can’t say, ‘Hang tight, we’ll tell you tomorrow.’ That’s an inference call with a strict latency expectation.

Now flip it: you’re scoring your entire customer list once a week to decide who gets a retention email. Nobody is sitting there waiting for a single prediction—so you care more about processing in bulk and cost efficiency than milliseconds.

One more term you’ll see a lot: payload. That’s just the input you send to the model (like JSON or CSV). If your payload format is wrong, the model may still be brilliant… and still return garbage.

If you keep these four words straight—model, training vs inference, endpoint, latency—you’ll understand what most exam questions are really asking.

## What You Need to Know: The 4 Main Inference Types (Batch, Real-Time, Serverless, Async)

Choosing an inference type is basically choosing what kind of ‘waiting’ your system can tolerate.

On AWS (and in SageMaker-style thinking), you’ll see four common inference modes: batch, real-time, serverless, and async. The exam isn’t asking you to memorize every knob—it’s testing whether you can match workload needs to the right shape of deployment.

1) Batch inference (offline scoring)

Batch is for when you have a pile of records and you want predictions for all of them, but nobody needs answers instantly. Think: nightly churn scoring, weekly demand forecasts, or re-scoring an entire catalog after you retrain.

The big idea: you run a job, it chews through data in bulk, and it writes outputs somewhere you can use later. In SageMaker terms, Batch Transform is designed for large datasets when you don’t need a persistent endpoint, and it can keep input records associated with their outputs (so you can trace predictions back to rows). [When batch transform is the right fit and how it returns bulk outputs](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)

2) Real-time inference (interactive)

Real-time is for ‘user is waiting’ scenarios: search ranking, fraud checks, product recommendations on a page load, chatbot responses, and so on.

Here, you deploy the model to a persistent endpoint and invoke it for immediate responses—low latency is the whole point. [What real-time endpoints are meant for and why they’re used for low-latency predictions](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html)

3) Serverless inference (real-time, but pay-per-use vibes)

Serverless inference is what you reach for when traffic is spiky or unpredictable, and you don’t want to manage (or pay for) always-on instances. It’s like keeping a taxi on standby only when you actually need rides.

One exam-friendly gotcha: serverless is a different deployment option than instance-based real-time endpoints. In fact, you can’t just ‘convert’ an existing instance-based real-time endpoint into a serverless endpoint—you deploy it as serverless from the start. [How serverless inference differs from instance-based endpoints and what you can’t convert](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html)

4) Asynchronous inference (request/response, but longer running)

Async sits between real-time and batch. You still send a request, but you don’t block the user waiting for the model to finish. Instead, you typically hand the work off, let it process, and pick up the result later.

Use async when a single inference can take longer (think: bigger payloads, more complex processing, or higher variance), and you’d rather not tie up a real-time endpoint experience.

How to pick quickly (the exam version)

Ask three questions: (1) Does a human/app need the answer now (latency)? (2) Is traffic steady or spiky (capacity planning)? (3) Do you need an always-on endpoint, or can this be a job or queued workflow?

If you can answer those, you can usually eliminate two options immediately.

## Practical Scenarios: Which Inference Mode Would You Use (and Why)?

If you can picture the workload, the right inference mode usually picks itself.

Scenario 1: ‘Check this transaction for fraud before approving the purchase.’

That’s real-time. The customer is literally waiting at checkout, and the business impact is immediate. You deploy behind an endpoint, your app calls it, and you optimize for fast responses and predictable performance.

Scenario 2: ‘Every night, score all customers and write a risk score back to Redshift.’

That’s batch. Nobody is waiting for one prediction, and you’re operating over a large dataset. It’s often cheaper and simpler to run bulk scoring, store outputs, and let analytics tools (like QuickSight) read the results the next day.

Scenario 3: ‘We have a doc-processing pipeline: invoices arrive in S3, we extract text, then classify and route them.’

This is where batch or async often shines. The work is naturally event-driven, the processing time can vary by document length/quality, and you generally care more about throughput and reliability than sub-second latency.

Scenario 4: ‘Our internal tool gets 10 requests per minute… until a marketing campaign, when it jumps to 2,000.’

This is a strong serverless candidate. You don’t want to pay for peak capacity all month just to survive one busy week. Serverless helps when traffic is bursty and you’d rather let AWS handle scaling decisions.

Scenario 5: ‘Users upload a 2-minute audio file, we transcribe and summarize it.’

If you try to force this into classic real-time, you’ll annoy users or time out clients. Async is a better mental model: accept the request, process in the background, then return the result when it’s ready (or notify the user).

One simple rule to remember while studying: real-time is about conversations; batch is about spreadsheets; serverless is about unpredictability; async is about patience without blocking.

If you can say which of those four your workload feels like, you’re already most of the way to the correct answer.

## Exam Tips + Common Mistakes: Monitoring, Pipelines, Data Formats, and “Trick” Distinctions

The exam doesn’t just test ‘what option exists’—it tests whether you’d run it sanely in production.

Monitoring: for real-time systems, watch latency like a hawk. If the question mentions CloudWatch alarms, p95 latency, or load testing, it’s nudging you toward an operational mindset (and often toward real-time endpoints).

Pipelines: if you need consistent preprocessing (like tokenization, normalization, or feature transforms), think about chaining steps so training-time logic matches inference-time logic. This is a classic place teams accidentally introduce bugs: ‘it worked in the notebook’ because preprocessing wasn’t identical.

Data formats: a lot of inference failures are boring—wrong content-type, mismatched JSON fields, unexpected CSV columns. On the exam, don’t overthink it: if the payload can’t be serialized cleanly, the model can’t help you.

Trick distinction to memorize: serverless vs instance-based endpoints. They’re not the same thing, and you don’t treat them as a simple switch you flip later.

Quick recap:

- Real-time: interactive, low latency.
- Batch: offline, bulk scoring.
- Serverless: spiky traffic, pay-per-use feel.
- Async: longer processing without blocking the caller.

If you keep those four buckets straight, a surprising number of AIF-C01 questions become ‘spot the pattern’ instead of ‘memorize the service.’


---

## About Upcert

Upcert.io provides industry leading, high quality practice exams for cloud certifications. Its platform allows users to study more efficiently by focusing on content where the study needs, and skipping content that the user already knows. Its also provides highly customized exam and certification readiness checks. Not sure if you're ready for your AWS exam? Create a free account to get access to 100 practice questions and 3 mock exams to help you find out. No credit card required.

Sign up for free: https://upcert.io/signup