[Edit: I found a better analogy to explain what happened. See the bottom of this post]
Over the last three weeks we’ve been hit by an intermittent outage that knocked our SCCM infrastructure essentially offline for hours at a time. It would mysteriously start and after 2-4 hours it would mysteriously stop. During that time you could PXE boot a machine but it would not find any task sequences available and WinPE would reboot out from under the system. We were running down leads on network problems, DNS issues, WINs problems, SCCM infrastructure problems, just about everything under the sun. The problem would correct itself though before we could get anywhere.
We started combing through the status messages during the last outage and found something unusual. During the outage there was a flood of 5101 status messages (“Policy Provider successfully updated the policy and the policy assignment that were created from package…”) from the SMS_POLICY_PROVIDER log. A flood to the tune of nearly 8000 per hour. It appeared that just about every package in every task sequence we had was having its policy updated.
Our next problem was finding out what triggered all of these policy updates. Digging deeper into the status messages found that immediately prior to the 5101 messages pouring in was a 30001 message (“User domain\user modified the Package Properties of a package…”) showing that someone had modified the properties of one of our task sequences in development.
That someone was me.
At this point things fell together. The morning of the latest outage I had been working on a new development OSD task sequence. We use NomadBranch from 1E in our environment. The product has extensions that add a “Nomad” tab to the property page that allow you to configure the software’s settings. On the properties of a task sequence you can ensure that all packages referenced will be configured correctly.
That morning I enabled the “Enable Nomad” check box. The Nomad extensions then cycled through all 112 packages referenced by the task sequence and ensured that setting was enabled on each and every one. A very convenient option. It prevents us from having to manually check each and every package to ensure that the Alternate Content Provider is set.
Great except modifying the Alternate Content Provider is one of those package properties that triggers a policy update in SCCM. And if a package requires a policy update SCCM will cycle through all references to that package and update the policies for all deployments/advertisements for those references.
So for each of those 112 packages SCCM then found every other task sequence that referenced them. And then for each of those instances it would initiate a policy update for each and every deployment.
This problem is not a NomadBranch issue though. You can accomplish the same thing with any mass-manipulation of the package properties. If you use a script to alter the “Disconnect uses from distribution points” option (found on the Data Access tab) on a series of packages SCCM will start cross referencing each and every package and find all of the task sequence deployments that reference that package and update the policy on them. Then it will repeat the process for the next package and so on and so on and so on….
This is easy to duplicate.
[Warning, do not attempt this in a production environment!]
Within the SCCM 2012 console open the Monitoring node. Then expand the System Status branch and select Status Message Queries. Right-click on the All Status Messages query and select Show Messages.
Now, select a package that you know is in a couple of task sequences. Perhaps the OS image, or a driver package. Something that will be referenced by multiple sequences. Open the properties of that package, select the Data Access tab and toggle the “Disconnect users from the distribution point” option and click Apply.
Go back to your status message query and refresh (F5). You should right away see the 30001 and 23xx messages showing that you have updated a package and it is being processed by SCCM. Within a few moments the 5101 messages should appear, one for every package+sequence+deployment combination. Now, imagine that multiplied by every package within your task sequence.
What’s the morale of this story?
Use caution when doing any kind of mass update of package properties.
In our situation the flood of policy updates appears to have overwhelmed our Management Points. They were too busy fielding the policy updates to handle policy requests from the systems attempting to start OS deployments.
Is this Nomad’s Fault or SCCM’s Fault?
We shot ourselves in the foot though and brought this down on ourselves. What did us in was the vast number of deployments/advertisements we have out there. If we had been better stewards of SCCM and cleaned up after ourselves this wouldn’t have knocked us out of the water. We had hundreds of stale, out of date deployments that had never been cleaned up after they were done. It was those old deployments that acted as the gas being poured on the fire.
When explaining what happened I came up with a better explanation of what exactly caused the problem.
It was this vast number of deployments that brought the house down. It was like a series of nested FOREACH statements….
FOREACH package in the task sequence
FIND all other task sequences that reference the package
FOREACH of those task sequences
FIND each deployment for that task sequence
FOREACH of those deployments
That’s a lot of multipliers there. That’s what ultimately killed us, the large number of deployments (~500) we had lingering around, most of which were out of date.
Had we been better about cleaning up old, out of date deployments I don’t think we would have ever had an issue.