Catching an Edge - On the Trail to Microsoft Certified Master: Don’t give me your Cooties – When a Domain Controller Refuses to Replicate

If you were to walk onto an airplane during the height of flu season wearing a surgical mask and vehemently hacking and coughing, the guy sitting next to you is likely to scan around for an empty seat to move to, as far away from you as possible. Face it: no one wants your cooties.

A Domain Controller (DC) has a similar aversion to getting sick and when a replication partner is showing signs of illness, the healthy DC doesn’t hesitate to quarantine the sick DC by refusing to replicate from it.

So Doctor, what symptoms does the healthy DC check for before it declares its replication partner, persona non grata?

Symptom Number One – Schema Version Mismatch

The AD Schema is similar to a building code. If I want to build a home in Fort Collins, Larimer County, Colorado where I live, I can’t just march down to Home Depot, load up the truck with supplies, and start framing a home. There are rules to adhere to. So I start off by finding out what the county building code is and then pay for a building permit which legally requires me to adhere to the code. The local code dictates the rules for building the structure - ceilings have to be so high, electrical outlets placed so many feet apart, the roof pitched so many degrees to support the area’s snow load, etc.

Similarly, the AD Schema is the building code for an AD Forest in that all of the Forest’s domains have to abide by the same set of rules for object creation. Each DC in the forest stores a local copy of the AD Schema in its Active Directory database file, loads it into memory and then consults the Schema when new objects are added to the Forest. The Schema basically dictates the types of objects that can be created and what attributes are allowed for those objects, the same way a city’s building code dictates the types of dwellings that can be built and their allowed characteristics.

What would happen if you built your home using an old version of the building code that had an outdated electrical compliance section? Simple - when the home inspector checks your home’s compliance with code, you would be told to pull out all the wiring and start over again; otherwise you are not getting a certificate of occupancy.

In AD, a DC will not replicate data from a partner DC if the partner is using a different version of the building code (schema). Before replication occurs, the partner DCs will exchange schema version level information. If the DCs don’t have the same schema version, replication of AD Forest Objects between those partners will be held up until the schema itself is fully replicated and both DCs are at the same Schema version.

In reality, this is not a symptom of illness itself as it just means the destination DC will wait to raise its schema level on par with its partner DC, but it does result in a replication delay.

Symptom Number Two – Detection of Lingering Objects

Now this just sounds nasty. Before you consult the medical dictionary, let me first tell you what a lingering object is.

When you create an AD object like a user object, the object will be assigned attributes or properties, such as common name, password, group memberships, etc. The object then replicates to other DCs in the same domain. AD Replication happens at the attribute level, meaning that when an attribute changes, only that attribute and not the entire object will be replicated.

A little more background is required before you can understand how a DC comes down with a case of lingering objects.

When you delete an object in AD, it becomes a tombstone object and remains in the AD database for a period of time before it is permanently removed. This “obituary” period is known as the tombstone lifetime (TSL) and is how other DCs learn through replication of an object’s demise. The default TSL for Windows 2008 is 180 days although an Enterprise Admin can set it to whatever they want.

Now let’s say that you purposely delete employee Joe’s user object because he just won the lottery and he told you what he really thought of his job before resigning from the company. According to the TSL, 180 days later, object user Joe should be permanently removed from AD.

But what happens if there is a forgotten DC from the domain that has been neglected and has not replicated for 190 days, longer than TSL? That DC never received Joe’s tombstone obituary notice through replication and therefore did not delete its local copy of the user object Joe, maintaining like some Elvis fanatics that user object Joe is still “alive”. A DC not replicating for 190 days sounds like an unlikely scenario, but trust me, it happens.

To continue my scenario, let’s say you send a junior admin who was foolish not to participate in the office lottery pool to the branch office to put things in order. Said Junior Admin fixes the replication problem but also decides out of boredom to run a script that sets the “favorite drink” attribute on all domain user objects to Jolt Cola.

When live-user object Joe’s favorite drink attribute is changed and the branch DC notifies its partner DCs of the change, the other DCs will detect during the replication attempt that the branch DC has a lingering object and will halt inbound replication.

How do they detect this? Because the other DCs will notice that they are trying to replicate “Joe’s favorite drink” attribute for an object they have no knowledge of since they permanently deleted Joe a long time ago. The other DCs are essentially saying: “You are trying to tell me Joe’s favorite drink? – I don’t even know who Joe is!”

Meanwhile flesh and blood Joe is putting his lottery money to work on a beach in Jamaica where his new favorite drink is now a Pina Colada.

Symptom Number Three – Last replication is greater than Tombstone Lifetime

Fortunately, a DC doesn’t have to try and replicate changes to a lingering object to notice something is wrong. A Windows 2008 DC will simply keep track of when their partner DC last replicated and if it is greater than TSL, inbound replication from the loafer DC will be halted. Two error messages will be noticed in the DC’s Directory Service log: Error 1864 and Error 2042.

In the following example, I configured DC1 and DC2 as domain controllers in the same domain. I also manually set the TSL to 2 days and then shutdown DC2. Three days later, I brought DC2 back online, but by failing to replicate longer than TSL, partner DC1 refuses to replicate from it. The following two errors 1864 and 2042 were recorded in DC1’s Directory Service log as can be seen in the following graphics.

Symptom Number Four – Detection of USN Rollback

Each DC maintains a copy of the AD database file NTDS.DIT and assigns the database a unique ID called the Invocation ID.

At the same time, when a DC creates an object or modifies an object’s attributes in the database, it assigns that LDAP write transaction, a unique locally generated number called an Update Sequence Number (USN).

Each DC maintains its own USN counter and the counter increment is not synchronized with other DCs. Think of the Invocation ID as a Journal and the USN counter as sequential changes to the database recorded in that journal. Each DC maintains its own system of Journals and USNs.

So let’s examine a simple AD database change on DC2 that then replicates to DC1 and see how each DC keeps track of the changes.

Let’s say DC2 starts off the day with a USN counter of 1000 as last recorded in Journal # 1 (its invocation ID). If DC2 then modified three different object attributes at different times of the day, each LDAP write transaction would be sequentially assigned DC2’s next available local USN of 1001, 1002, and 1003 respectively, all recorded in Journal # 1.

DC1 will then replicate those three attributes and record that the last change it received from DC2 was Journal 1/USN 1003.

USN rollback occurs when a DC’s database is improperly restored and “undoes” previous changes in the database, for example, if I restore DC2 from a virtual machine snapshot that was taken right before those three attributes were modified. After the improper restore, DC2 is back to believing its current USN counter is 1000 on Journal # 1, effectively forgetting that it had assigned USNs 1001-1003. It’s partner DC1 however still believes the last update it received from DC2 was Journal 1/USN 1003.

When DC1 detects that it has knowledge of a higher USN for DC2 (Journal 1/USN 1003) than DC2 has for itself (Journal 1/USN 1000), DC1 informs DC2 it is in USN rollback and DC2 halts all inbound and outbound replication and also pauses its NETLOGON service.

This will be recorded in the Directory Service Log on DC2 as errors 2095 and 2103 as seen in the following graphics.

Run, Forest, Run

So now you know why a healthy DC runs in the opposite direction from one that appears ill, but how do you fix one of these problems and nurse the ill DC back to health? Antibiotic therapy and defibrillation steps will be covered in part 2. Stay tuned.