SharePoint: Common NTLM Authentication Issues, aka: Consider Ditching NTLM
Posted On January 26, 2021
Update 4/1/22: Added Important note to Issues #2 and #6
Update 1/26/21: Added Issue #7
NTLM authentication is not great.
It’s not the fastest. In most cases, that honor would go to Kerberos.
It’s not the most secure. Again, Kerberos.
It’s not all that flexible. For example, it doesn’t work well for extranets or anything cross-firewall. In those scenarios, Trusted Provider auth (SAML / WS-Fed) works well. See: AD FS.
It doesn’t work well with mobile clients, especially iPhone, iPad, etc. — Just search the Interwebs for “ios ntlm prompt” and you’ll see what I mean — Some of this is due to the fact that those devices are not joined to the Active Directory domain, and some of it is because NTLM is a Microsoft technology and others are not great at implementing it client-side. Regardless, the best solution is to use Trusted Provider authentication, which is usually cookie-based and works well for all clients. — If you’re apprehensive about changing your authentication scheme within SharePoint just to appease your “mobile” users, you could use a Web Application Proxy (WAP) front-end as described here. In that case, authentication is cookie-based between the client and WAP, but still uses Windows Integrated (Kerberos in this case) between WAP and SharePoint, meaning you don’t have to do any user migration within SharePoint.
So why do so many still use it?
It’s the old stand-by. It works good enough, and there’s typically nothing extra you need to configure to get it to work. You just turn it on and it works. Unless it doesn’t, which is what this post is about.
Problems with NTLM usually manifest themselves in one of two ways:
1. Users cannot log in at all. They receive authentication prompts and then a 401 – Access Denied.
2. Users receive (seemingly) random authentication prompts when browsing SharePoint sites.
One thing to keep in mind when troubleshooting NLTM issues with SharePoint is that the problem is almost always external to SharePoint. Aside from turning it on or off, there’s not really anything you can configure inside of Sharepoint to make NTLM work better or worse. To enable NTLM, this is all you do within Central Administration | Manage Web Applications | <Your web app> | Authentication Providers:
And this is the resulting configuration in IIS Manager | <Your Site> | Authentication | Windows Authentication | Providers:
Here are some known issues with NTLM in no particular order:
The network load balancer (NLB) is bouncing the client between web-front-ends (WFEs) in the middle of the “NTLM Handshake”.
Note: See “other troubleshooting tips” section below for details on the “NTLM Handshake”.
I know there’s some documentation out there that suggests that session persistence / affinity / “sticky sessions”, is no longer required with the advent of Distributed Cache in SharePoint 2013 and above. However, that is not the case, at least not as long as you’re using NTLM. Staying on the same WFE is vital to any challenge / response authentication process (like NTLM).
Clearly, if the NTLM challenge comes from one WFE, but we send the response to another, that’s not going to work.
See this: https://en.wikipedia.org/wiki/Challenge–response_authentication “A more interesting challenge–response technique works as follows. Say, Bob is controlling access to some resource. Alice comes along seeking entry. Bob issues a challenge, perhaps “52w72y”. Alice must respond with the one string of characters which “fits” the challenge Bob issued. The “fit” is determined by an algorithm “known” to Bob and Alice. (The correct response might be as simple as “63x83z” (each character of response one more than that of challenge), but in the real world, the “rules” would be much more complex.) Bob issues a different challenge each time, and thus knowing a previous correct response (even if it isn’t “hidden” by the means of communication used between Alice and Bob) is of no use. A part of Alice’s response might convey that it is Alice who is seeking authentication.”
Now consider the above “Bob and Alice” scenario without session persistence (sticky sessions). Bob issues the challenge. Alice sends the response to Fred, who has no idea what she’s talking about. Authentications fails.
Configure your NLB for “sticky sessions” so that a given client stays on a given WFE, at least throughout the authentication process.
Users are denied access due to settings in the local security policy on the WFEs.
Reproduce the problem and take a look at the Security Event Log on the WFE. You may see a logon failure event like this:
A logon type of “3” is a network logon. The failure reason tells us that there is something in the local security policy (possibly set by Group Policy) that is not allowing the user to logon.
Important: If you are not seeing logon failures in the Security event log, it could be because logon event auditing is not enabled. Open the Local Security Policy (secpol.msc) on the machine and go to Local Policies | Audit Policy | Audit logon events. Make sure at least “Failure” is selected.
Run SecPol.msc from the Run prompt or command line. Check Local Policies | User Rights Assignment. These two policies should be your focus:
Access this computer from the network
Deny access to this computer from the network
Check all group memberships for your problem user(s) to make sure they are allowed access from the network and not explicitly denied via those two policies.
By default, there are no users or groups listed in “Deny access to this computer from the network”, and the following groups normally have the “Access this computer from the network” privilege: – Administrators – Backup Operators – Everyone – Users
From a Fiddler / IIS Log / data capture perspective, this one can be difficult to diagnose. IIS logs may just show 401.0, 401.1, 401.1, with the last 401.1 showing a “sc-win32-status”of “2148074252”, meaning “The logon attempt failed”, which is not overly helpful. However, if you go look at the registry or group policy editor on the applicable machines as described below, it should be easy to spot a problem.
Check the LmCompatibilityLevel Registry key for client, WFE, and DCs. Make sure the value is compatible between the three.
LmCompatibilityLevel is located here: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa
Note: This setting can be controlled by Group Policy (GPO), so you should check that to make sure any registry changes you make do not get reverted the next time group policy is applied. If you run gpedit.msc, you’ll find it under Computer Configuration | Windows Settings | Security Settings | Local Policies | Security Options:
If these are being set by GPO, you’ll need to change that on the domain controller and reapply group policy.
Important: You may have to reboot before changes take effect.
DNS / Domain Trust problems.
This is most likely to occur for users that are in a remote domain or trusted forest. If DNS is not configured properly, the SharePoint WFE will not be able to get the proper IP address for a remote domain controller.
This one is a little harder to nail down. It can take a network trace with Netmon or Wireshark to fully diagnose. However, a good indication of the problem may lie in your IIS logs.
Check the IIS log for the problem SharePoint site. You may see that the final request that includes the whole NTLM token receives a 401.1 with a sc-win32-status of 2148074257.
Fix your DNS so that the SharePoint servers get the proper IPs for remote domain controllers. You should also verify your domain and forest trusts.
This is a bit of a complicated topic, but you can sum it up like this: There is a finite number of Netlogon process threads available for NTLM authentication on both the SharePoint WFEs and the domain controllers. When that number is exceeded, authentication requests can fail. This typically happens in large environments with heavy NTLM traffic, and especially when that authentication occurs across domain trusts.
Switch SharePoint (and other applications) to use Kerberos authentication.
This cuts down significantly on Netlogon service traffic, in most cases relieving the bottleneck. However, keep in mind that Kerberos authentication can still be impacted by MaxConcurrentAPI if there is a significant amount of it requiring PAC verification, or if NTLM authentication from other applications is saturating available threads.
Another option is cutting down authentication traffic by making more resources available anonymously.
For example, within an out-of-box SharePoint site, all supporting files (CSS, JS, images, etc) are stored on the file system and are available anonymously (most are in the _layouts folder). However, some customizations and branding may store supporting files within a document library where an authentication request must occur for each file request. The result can be a dozen or more NTLM authentication requests for each page load. Moving those supporting files their own folder in _layouts, or otherwise making them anonymously accessible will drastically reduce total authentication traffic when browsing the site.
Selective Authentication enabled on the domain trust
Note: this would typically result in a scenario where users in the same domain as the SharePoint servers can authenticate successfully, but users in trusted domains cannot.
Check the IIS log for the problem SharePoint site. You may see that the final request that includes the whole NTLM token receives a 401.1 with a sc-win32-status of 2148074252.
— Sc-win32-status “2148074252” means: SEC_E_LOGON_DENIED – The logon attempt failed
-Not very helpful, so we need to keep looking…
— Look at the Security event log on the web-front-end:
-You should see an audit failure for a logon event like this:
Important: As shown in Issue #2 above, if you are not seeing logon failures in the Security event log, it could be because logon event auditing is not enabled. Open the Local Security Policy (secpol.msc) on the machine and go to Local Policies | Audit Policy | Audit logon events. Make sure at least “Failure” is selected.
— The failure message is “An Error occured during Logon”
-Also not helpful. However, looking up the Status code “0xC0000413” reveals:
Logon Failure: The machine you are logging onto is protected by an authentication firewall. The specified account is not allowed to authenticate to the machine.
Either remove selective authentication from the domain trust, or grant the “Allowed to Authenticate” permission to the users on the SharePoint web-front-end computer objects.
The client IP changes mid-NTLM handshake, which invalidates the handshake.
This is similar to issue #1, but occurs on the client-side of the NTLM handshake. It may manifest itself as an intermittent prompt for credentials. You would see a sequence like the following in the IIS log:
— There’s 21 seconds between the second and third GET, which would indicate the user was prompted and then entered their credentials.
So the second request is the one of interest, as that’s where the authentication prompt it happening.
The result was 401.1 with “sc-win32-status” of 2148074248
2148074248 means: SEC_E_INVALID_TOKEN — The token supplied to the function is invalid.
Notice that the Client IP of the first request is 192.168.0.33, but the client IP for the second request (and third and fourth) is 192.168.100.56.
This is a problem. Here’s why:
The first 401.1 with sub-status 2148074254 (which means: “No credentials are available in the security package”) is expected because it’s an anonymous request. That’s when the client is challenged by the server. The Client IP for that challenge is 192.168.0.33.
The next request actually comes from the same user / same client machine, but shows as coming from Client IP 192.168.100.56 and contains the response to that challenge. But since it comes from a different Client IP, it is not seen as a valid challenge response by the server, so 401.1 is sent again with sub-status 2148074248 (which means: “The token supplied to the function is invalid”). -That’s when the credential prompt occurs.
You would need to talk to your networking team to understand why this is happening.
Generally, a reverse proxy allows you to either pass-through the original client machine IP, or substitute the reverse proxy IP as the client IP. Either one should work as long as the client IP stays consistent throughout the users session.
Other troubleshooting tips:
Test it outside of SharePoint:
This is a good isolation technique. The idea is to see if NTLM is working at all on your SharePoint web-front-ends.
Create a file share on the WFE. From a client machine that is having problems authenticating to SharePoint, try to access the file share using the WFEs IP address. Example: \\192.168.0.33\Share. You must use IP and not the server name to force NTLM. If you use the server name, Kerberos will normally be used to authenticate to the share, which is not the test we’re going for. Does accessing the share by IP work? If you get prompted for credentials and can’t authenticate, you should probably leave your SharePoint admins alone and start talking to your AD admins.
Note: This test may not be conclusive on Windows Server 2016 or other platforms where accessing a file share by IP is prohibited.
Use your tools:
As we saw in the above sections, IIS logs, the Security Event Log in Event Viewer, and Network traces can assist in diagnosing these problems. In this section, I’ll walk you through using Fiddler to view the authentication traffic. The purpose is to show what a successful NTLM authentication looks like. You can use that to compare to your own trace of a failure.
NTLM authentication is done in a three-step process known as the “NTLM Handshake”.
The first request is normally made anonymously. This is true of Kerberos as well.
The site requires authentication, so the SharePoint server responds with a 401 – Unauthorized and a “WWW-Authenticate: NTLM” header. That header is how the server tells the client which authentication methods to try.
The client makes a second request for the same page. This time it includes half of the NTLM token. The server again responds with a 401 (unauthorized) and issues an NTLM challenge.
The client makes a third request with the whole NTLM token, is successfully authenticated, and receives a 200-ok for home.aspx.
Note: The NLTM Handshake is not really a half-token / full-token situation, but for the purposes of simplifying the NTLM Handshake process, I find that explanation works well enough. I think it helps to differentiate the second request (notice the client NTLM authorization header is fairly short) from the final request (NTLM header is much longer). If you see your client send the full NTLM token, but the server still responds with 401 – unauthorized, then you need to look closer at the known issues described above.