Background processing is where Pega applications quietly do most of their heavy lifting — sending notifications, calling downstream systems, running nightly reconciliations, and absorbing bursts of asynchronous work. It is also where problems hide, because nothing is on a user's screen when a queue processor silently stops draining or a job scheduler never fires. This guide explains how the modern background-processing stack works and gives you a repeatable method for diagnosing items stuck in the queue, schedulers that never run, and the Stream-service failures that quietly break everything downstream.
The modern background-processing model
Pega Infinity replaced the legacy Agents model with two purpose-built rule types. Understanding the split is the first step to debugging either one.
- Queue Processor rules consume messages from a queue and process them one at a time. They are event-driven: something calls
Queue-For-Processing, a message lands on the Stream, and the processor picks it up. Use queue processors for asynchronous, per-item work. - Job Scheduler rules run on a time schedule (a recurring cron-like cadence or a one-off time). They are the modern replacement for time-based agents. Use job schedulers for periodic batch work — cleanups, aggregations, polling.
The two are complementary: a job scheduler often enqueues work that a queue processor then drains. Knowing which of the two is involved tells you immediately where to look.
| Aspect | Queue Processor | Job Scheduler |
|---|---|---|
| Trigger | Message enqueued (Queue-For-Processing) | Time schedule / cron |
| Backing store | Stream service (Kafka/DDS) | Database (no Stream needed) |
| Unit of work | One queued message at a time | One scheduled run |
| Replaces | Standard agents (queue-based) | Time-based agents |
| Typical use | Async per-item processing | Periodic batch jobs |
Queue Processors: standard vs. dedicated
There are two flavors of queue processor, and choosing wrong is itself a common cause of trouble.
- Standard queue processors share the platform-provided queues (such as
pyProcessNotifications). They are simple to use — callQueue-For-Processingand you are done — but you share throughput and back-pressure with everything else on that standard queue. - Dedicated queue processors get their own named queue and their own configuration: concurrency, retry counts, and message-handling behavior. Use a dedicated queue processor for any high-volume or business-critical flow so it is isolated from noisy neighbors.
You place work on a queue with the Queue-For-Processing method (or its smart-shape equivalent). The call writes the message to the Stream and returns immediately:
// Enqueue an item for asynchronous processing by a dedicated queue processor
Queue-For-Processing
QueueProcessor: ProcessPaymentEvent
// The page passed becomes the message payload the QP receives as pyMessage
RequestPage: PaymentEvent
// Items can be queued for immediate pickup or delayed/scheduled execution.
// Delayed execution stamps a "process after" time; the QP ignores the
// message until that time passes.
Queue processors support both immediate execution (drain as fast as nodes allow) and delayed/scheduled execution (process the item only after a specified time — useful for retry backoff or "do this in 24 hours" logic).
Job Schedulers and node classification
Job schedulers do not use the Stream at all — they read their schedule from the database and execute on whichever nodes are configured to run them. This is where node types (node classification) become critical.
In a multi-node cluster, you assign each node a node type (for example, BackgroundProcessing, WebUser, Search, Stream). Background work — both queue processors and job schedulers — runs only on nodes of the appropriate background type. The number one reason a job scheduler or queue processor "never runs" is that no node in the cluster is classified to run it. If every node is a WebUser node, your background rules sit idle forever, with no error, because there is simply no consumer.
# Node startup property that sets node classification (conceptual)
-DNodeType=BackgroundProcessing,Search
# Verify in Admin Studio > Resources > Nodes that at least one node
# carries the node type your QP / Job Scheduler targets.
Always confirm node classification before debugging anything else. It is a five-second check that resolves a large share of "nothing is processing" incidents.
Managing queues in Admin Studio
Admin Studio is your operational console for background processing. Under the queue-processor and job-scheduler views you can:
- See each queue processor's status, the node(s) running it, and live throughput.
- Inspect queue depth — how many messages are waiting.
- View broken items that failed processing and landed in the broken-item queue.
- Trace, retry, or delete individual items.
- Confirm a job scheduler's last run time and next scheduled run.
If Admin Studio shows a queue processor with messages waiting but no node assigned, you have a node-classification or Stream problem, not a logic problem.
Broken items and the broken-process queue
When a message fails processing and exhausts its retries, Pega moves it to the broken-item queue rather than discarding it. This is a safety net, but it is also where work goes to die if nobody watches it.
A broken item carries the failure context — the exception, the stack, and the original payload — so you can diagnose the root cause. The typical loop is:
- Open the broken item in Admin Studio and read the error.
- Fix the underlying cause (bad data, a downstream outage, a logic bug).
- Requeue the item so it processes again, or delete it if it is no longer valid.
If broken items pile up faster than you fix them, that is a signal of a systemic problem — a downstream system is down, or a code path throws on a whole class of payloads (a poison message pattern, discussed below).
Retries, error handling, and commit semantics
Queue processors have built-in retry behavior. On failure, the platform can retry the message a configured number of times — optionally with delay — before moving it to the broken queue. Configure retry count and delay on a dedicated queue processor to match the failure profile of the work (transient network errors deserve retries; deterministic data errors do not).
Commit semantics matter more than people expect:
- Each message is processed in its own transaction. If the processing activity completes successfully, the work commits.
- If it throws, the transaction rolls back, and the message is retried or broken — so partial side effects from a half-finished run are undone, provided you let the framework manage the commit rather than committing mid-activity.
- Avoid manual
Commitcalls inside queue-processor logic unless you fully understand the consequence; a premature commit defeats the rollback safety net and can leave the system in a half-updated state on retry.
# Conceptual lifecycle of one queued message
ENQUEUED --> (node picks up) --> PROCESSING
success --> COMMIT --> done
failure --> ROLLBACK --> retry (n times, optional delay)
retries exhausted --> BROKEN (broken-item queue)
The Stream service (Kafka / DDS) dependency
This is the dependency that surprises teams: queue processors are backed by the Stream service, which runs on Kafka (the embedded distributed data store, DDS). Queue-For-Processing writes to a Stream topic, and the queue processor consumes from it. Job schedulers do not depend on the Stream — only queue processors do.
That means:
- If the Stream service is down or unhealthy,
Queue-For-Processingmay fail or messages may not be consumed, and queue processors stall — even though the rules and nodes are perfectly fine. - Stream nodes must be classified (
Streamnode type) and healthy. In Admin Studio under the Stream/services view, every Stream node should report healthy. - A common production incident is losing Stream quorum during a partial cluster restart; queue processing freezes until the Stream cluster recovers.
When queue processors are stuck but job schedulers run fine, suspect the Stream service first — that asymmetry is a strong diagnostic signal.
A troubleshooting playbook
Work through these in order; each step rules out a class of cause.
- Items stuck in "scheduled" / not processing — check whether they are delayed items whose process-after time has not arrived yet (expected), versus genuinely stuck. Confirm in Admin Studio.
- No node running the queue processor — verify node classification. If no
BackgroundProcessingnode exists for that rule, classify one. This is the most common root cause. - Stream service down — check Stream node health. If Stream is unhealthy, queue processors stall while job schedulers keep running. Restore Stream quorum.
- Deadlocks — long-running processing that locks the same case or database rows can deadlock under concurrency. Reduce queue-processor concurrency, shorten transactions, and avoid locking the same object from multiple messages simultaneously.
- Poison messages — a single malformed payload that throws every time consumes all its retries, lands in the broken queue, and (if it blocks a partition) can stall throughput. Identify it from the broken-item queue, fix or quarantine the payload, and add defensive validation so one bad message cannot stall a whole flow.
A quick decision aid:
- Queue processor stuck, job scheduler fine → Stream or node classification.
- Both stuck → node classification (no background node at all).
- Items processing but failing → broken-item queue / retries / logic bug.
- Job scheduler never fires, queue processors fine → node classification or schedule configuration.
Monitoring and alerts
The goal is to find these problems before users do. Build observability around three signals:
- Queue depth and age — a growing backlog or rising oldest-message age means consumption is falling behind. Watch these in Admin Studio and, where possible, export them to your monitoring stack.
- Broken-item count — a nonzero and climbing broken count is an actionable alert. A flat small number may be acceptable; a steady climb is not.
- Stream and node health — alert on Stream node health and on the disappearance of background-processing nodes. PDC (Pega Predictive Diagnostic Cloud) and the platform's alert/log infrastructure surface many of these automatically — wire them to your on-call channel.
Set explicit thresholds (for example, "alert if any dedicated queue's oldest message exceeds N minutes" and "alert if broken items increase for three consecutive checks") so a stalled queue pages someone instead of silently accumulating overnight.
Key takeaways
- Two rule types, two failure modes. Queue processors are event-driven and Stream-backed; job schedulers are time-driven and database-backed. Know which one you are debugging.
- Node classification is the first check. No background-processing node means nothing runs, with no error to point the way.
- The Stream service is the hidden dependency. Queue processors stall when Kafka/DDS is unhealthy even though the rules are fine — job schedulers are unaffected, and that asymmetry is your clue.
- Broken items are a safety net, not a graveyard. Read the error, fix the cause, and requeue — and treat a climbing broken count as a real alert.
- Respect commit/rollback. Let the framework own the transaction so failed messages roll back cleanly and retry safely.
- Watch for poison messages and deadlocks under concurrency, and tune retries and concurrency to the work's failure profile.
If you are wrestling with a queue that will not drain or a scheduler that never fires, walk through it with someone who has debugged it before. Explore hands-on Pega mentorship or reach out via our contact page to get unblocked and build a monitoring setup that catches the next stall early.