Debugging Stuck Queue Processors and Job Schedulers in Pega

Background processing is where Pega applications quietly do most of their heavy lifting — sending notifications, calling downstream systems, running nightly reconciliations, and absorbing bursts of asynchronous work. It is also where problems hide, because nothing is on a user's screen when a queue processor silently stops draining or a job scheduler never fires. This guide explains how the modern background-processing stack works and gives you a repeatable method for diagnosing items stuck in the queue, schedulers that never run, and the Stream-service failures that quietly break everything downstream.

The modern background-processing model

Pega Infinity replaced the legacy Agents model with two purpose-built rule types. Understanding the split is the first step to debugging either one.

Queue Processor rules consume messages from a queue and process them one at a time. They are event-driven: something calls Queue-For-Processing, a message lands on the Stream, and the processor picks it up. Use queue processors for asynchronous, per-item work.
Job Scheduler rules run on a time schedule (a recurring cron-like cadence or a one-off time). They are the modern replacement for time-based agents. Use job schedulers for periodic batch work — cleanups, aggregations, polling.

The two are complementary: a job scheduler often enqueues work that a queue processor then drains. Knowing which of the two is involved tells you immediately where to look.

Aspect	Queue Processor	Job Scheduler
Trigger	Message enqueued (`Queue-For-Processing`)	Time schedule / cron
Backing store	Stream service (Kafka/DDS)	Database (no Stream needed)
Unit of work	One queued message at a time	One scheduled run
Replaces	Standard agents (queue-based)	Time-based agents
Typical use	Async per-item processing	Periodic batch jobs

Queue Processors: standard vs. dedicated

There are two flavors of queue processor, and choosing wrong is itself a common cause of trouble.

Standard queue processors share the platform-provided queues (such as pyProcessNotifications). They are simple to use — call Queue-For-Processing and you are done — but you share throughput and back-pressure with everything else on that standard queue.
Dedicated queue processors get their own named queue and their own configuration: concurrency, retry counts, and message-handling behavior. Use a dedicated queue processor for any high-volume or business-critical flow so it is isolated from noisy neighbors.

You place work on a queue with the Queue-For-Processing method (or its smart-shape equivalent). The call writes the message to the Stream and returns immediately:

// Enqueue an item for asynchronous processing by a dedicated queue processor
Queue-For-Processing
  QueueProcessor: ProcessPaymentEvent
  // The page passed becomes the message payload the QP receives as pyMessage
  RequestPage:    PaymentEvent

// Items can be queued for immediate pickup or delayed/scheduled execution.
// Delayed execution stamps a "process after" time; the QP ignores the
// message until that time passes.

Queue processors support both immediate execution (drain as fast as nodes allow) and delayed/scheduled execution (process the item only after a specified time — useful for retry backoff or "do this in 24 hours" logic).

Job Schedulers and node classification

Job schedulers do not use the Stream at all — they read their schedule from the database and execute on whichever nodes are configured to run them. This is where node types (node classification) become critical.

In a multi-node cluster, you assign each node a node type (for example, BackgroundProcessing, WebUser, Search, Stream). Background work — both queue processors and job schedulers — runs only on nodes of the appropriate background type. The number one reason a job scheduler or queue processor "never runs" is that no node in the cluster is classified to run it. If every node is a WebUser node, your background rules sit idle forever, with no error, because there is simply no consumer.

# Node startup property that sets node classification (conceptual)
-DNodeType=BackgroundProcessing,Search

# Verify in Admin Studio > Resources > Nodes that at least one node
# carries the node type your QP / Job Scheduler targets.

Always confirm node classification before debugging anything else. It is a five-second check that resolves a large share of "nothing is processing" incidents.

Managing queues in Admin Studio

Admin Studio is your operational console for background processing. Under the queue-processor and job-scheduler views you can:

See each queue processor's status, the node(s) running it, and live throughput.
Inspect queue depth — how many messages are waiting.
View broken items that failed processing and landed in the broken-item queue.
Trace, retry, or delete individual items.
Confirm a job scheduler's last run time and next scheduled run.

If Admin Studio shows a queue processor with messages waiting but no node assigned, you have a node-classification or Stream problem, not a logic problem.

Broken items and the broken-process queue

When a message fails processing and exhausts its retries, Pega moves it to the broken-item queue rather than discarding it. This is a safety net, but it is also where work goes to die if nobody watches it.

A broken item carries the failure context — the exception, the stack, and the original payload — so you can diagnose the root cause. The typical loop is:

Open the broken item in Admin Studio and read the error.
Fix the underlying cause (bad data, a downstream outage, a logic bug).
Requeue the item so it processes again, or delete it if it is no longer valid.

If broken items pile up faster than you fix them, that is a signal of a systemic problem — a downstream system is down, or a code path throws on a whole class of payloads (a poison message pattern, discussed below).

Retries, error handling, and commit semantics

Queue processors have built-in retry behavior. On failure, the platform can retry the message a configured number of times — optionally with delay — before moving it to the broken queue. Configure retry count and delay on a dedicated queue processor to match the failure profile of the work (transient network errors deserve retries; deterministic data errors do not).

Commit semantics matter more than people expect:

Each message is processed in its own transaction. If the processing activity completes successfully, the work commits.
If it throws, the transaction rolls back, and the message is retried or broken — so partial side effects from a half-finished run are undone, provided you let the framework manage the commit rather than committing mid-activity.
Avoid manual Commit calls inside queue-processor logic unless you fully understand the consequence; a premature commit defeats the rollback safety net and can leave the system in a half-updated state on retry.

# Conceptual lifecycle of one queued message
ENQUEUED  --> (node picks up) --> PROCESSING
  success --> COMMIT --> done
  failure --> ROLLBACK --> retry (n times, optional delay)
              retries exhausted --> BROKEN (broken-item queue)

The Stream service (Kafka / DDS) dependency

This is the dependency that surprises teams: queue processors are backed by the Stream service, which runs on Kafka (the embedded distributed data store, DDS). Queue-For-Processing writes to a Stream topic, and the queue processor consumes from it. Job schedulers do not depend on the Stream — only queue processors do.

That means:

If the Stream service is down or unhealthy, Queue-For-Processing may fail or messages may not be consumed, and queue processors stall — even though the rules and nodes are perfectly fine.
Stream nodes must be classified (Stream node type) and healthy. In Admin Studio under the Stream/services view, every Stream node should report healthy.
A common production incident is losing Stream quorum during a partial cluster restart; queue processing freezes until the Stream cluster recovers.

When queue processors are stuck but job schedulers run fine, suspect the Stream service first — that asymmetry is a strong diagnostic signal.

A troubleshooting playbook

Work through these in order; each step rules out a class of cause.

Items stuck in "scheduled" / not processing — check whether they are delayed items whose process-after time has not arrived yet (expected), versus genuinely stuck. Confirm in Admin Studio.
No node running the queue processor — verify node classification. If no BackgroundProcessing node exists for that rule, classify one. This is the most common root cause.
Stream service down — check Stream node health. If Stream is unhealthy, queue processors stall while job schedulers keep running. Restore Stream quorum.
Deadlocks — long-running processing that locks the same case or database rows can deadlock under concurrency. Reduce queue-processor concurrency, shorten transactions, and avoid locking the same object from multiple messages simultaneously.
Poison messages — a single malformed payload that throws every time consumes all its retries, lands in the broken queue, and (if it blocks a partition) can stall throughput. Identify it from the broken-item queue, fix or quarantine the payload, and add defensive validation so one bad message cannot stall a whole flow.

A quick decision aid:

Queue processor stuck, job scheduler fine → Stream or node classification.
Both stuck → node classification (no background node at all).
Items processing but failing → broken-item queue / retries / logic bug.
Job scheduler never fires, queue processors fine → node classification or schedule configuration.

Monitoring and alerts

The goal is to find these problems before users do. Build observability around three signals:

Queue depth and age — a growing backlog or rising oldest-message age means consumption is falling behind. Watch these in Admin Studio and, where possible, export them to your monitoring stack.
Broken-item count — a nonzero and climbing broken count is an actionable alert. A flat small number may be acceptable; a steady climb is not.
Stream and node health — alert on Stream node health and on the disappearance of background-processing nodes. PDC (Pega Predictive Diagnostic Cloud) and the platform's alert/log infrastructure surface many of these automatically — wire them to your on-call channel.

Set explicit thresholds (for example, "alert if any dedicated queue's oldest message exceeds N minutes" and "alert if broken items increase for three consecutive checks") so a stalled queue pages someone instead of silently accumulating overnight.

Key takeaways

Two rule types, two failure modes. Queue processors are event-driven and Stream-backed; job schedulers are time-driven and database-backed. Know which one you are debugging.
Node classification is the first check. No background-processing node means nothing runs, with no error to point the way.
The Stream service is the hidden dependency. Queue processors stall when Kafka/DDS is unhealthy even though the rules are fine — job schedulers are unaffected, and that asymmetry is your clue.
Broken items are a safety net, not a graveyard. Read the error, fix the cause, and requeue — and treat a climbing broken count as a real alert.
Respect commit/rollback. Let the framework own the transaction so failed messages roll back cleanly and retry safely.
Watch for poison messages and deadlocks under concurrency, and tune retries and concurrency to the work's failure profile.

If you are wrestling with a queue that will not drain or a scheduler that never fires, walk through it with someone who has debugged it before. Explore hands-on Pega mentorship or reach out via our contact page to get unblocked and build a monitoring setup that catches the next stall early.

Debugging Stuck Queue Processors and Job Schedulers in Pega

The modern background-processing model

Queue Processors: standard vs. dedicated

Job Schedulers and node classification

Managing queues in Admin Studio

Broken items and the broken-process queue

Retries, error handling, and commit semantics

The Stream service (Kafka / DDS) dependency

A troubleshooting playbook

Monitoring and alerts

Key takeaways

Keep reading

Debugging Complex REST/SOAP Integration Connector Exception Handlers in Pega

Configuring Secure OAuth 2.0 and SAML Authentication Profiles in Pega Infinity

Pega Guardrails: Reading and Improving Your Compliance Score

Stuck on something like this in production?