Managing Dynamics 365 Performance and Scalability

System performance is a critical non-functional requirement in any business application, yet we seldom take it into account until after there’s a problem. The worst thing that can happen after investing months in developing a system is for it to fail after being deployed to users. User feedback and political can be quite brutal when a system fails to meet expectations, and the consequences of a poorly performing system can be severe even to the point people lose their jobs. While system performance is often viewed as an intangible aspect of application development, it’s based on design patterns and distributed system architecture principles that are addressable. This means that while we may not be able to catch every possible scenarios we can be proactive in preventing many issues before they occur if we know what to look for in the system.

Performance vs. Scalability

Performance is defined as the ability to perform (with efficiency) or the manner in which a mechanism performs. The majority of systems appear to perform well during project development. While testing comes in different flavors, performance testing isn’t something many project teams have mastered in the Dynamics domain.

The challenge with performance testing is that without knowing what to look for we’re left to either try and test everything in an attempt to thorough or test a few key areas with the hope that we’ve guessed correctly. Performance testing is hard to do when no one really knows the usage pattern of users as they begin to put a real load on a system. This situation is the proverbial chicken and egg scenario where we want to know how the system will perform, but we have to deploy it to know how it will perform.While we can’t have 100% prediction of load users will cause on a system we can be pragmatic about how we design them.

Scalability, on the other hand, is the capability of a system to handle the demands of additional users. The question is will the system continue to perform as we add more users and data? While performance may be great for developers and the first few hundred users in the system, the question remains will it stay that way when we onboard thousands or tens of thousands. Even in the case of a system with 10,000 users that initially runs well we still have to ask will it continue as more data is added?

After system deployment, the most common tactic is to throw additional hardware at a poorly performing system as a stop-gap to buy the development team time. I like to call this the duct tape and pray strategy. Imagine you’re in a sinking boat and all you have is some tape to try and close the hole where the water is coming in. While managers are beefing up the hardware developers are sent in like commandos hunting for problems. Sometimes the hardware scaling strategy will work for a time, but often performance issues are baked into the design which means stopping the immediate bleeding doesn’t solve the problem long term. The fact is you must first design the system to be scalable. While the system may perform well initially performance isn’t an indicator or scalability which is the reason so many customers are caught off guard when performance suddenly degrades.

The Accidental Architecture

The Accidental Architecture pattern is a phenomenon that results from the nature of software development. Unlike a building architecture where all of the design happens in advance software development is more of a discovery process and a continuation of evolution over time. Users know what they want, but they tend to not fully understand what they need. As a development team releases software, the assumptions that went into that development cycle are tested. Feedback received can be factored into roadmap decisions about implementation priorities, what features to add, what needs to change or possibly things to remove altogether. This means that the system is always in some state of flux.

Regardless of the delivery methodology used in creating software the system architecture eventually ends up as a layered set of functionality based on past decisions and current design decisions. The various components added or changed in the implemented architecture may introduce unanticipated friction with other system components.

Feature Bugs vs. Systematic Bugs

Most anyone that has been around the IT space knows of or has heard about software bugs. The bugs that we normally think about are feature or functional bugs where software doesn’t behave correctly per one or more business requirements. Business stakeholders and testers can more easily find and detect feature functional bugs because they are more observable. Most testing efforts are focused on feature testing because stakeholders can see if features work or not.

Systematic bugs, however, are a dysfunction of the features working together which I call functional antagonism.Not only are systemic bugs much less observable, but the possible variations of system testing scenarios grows exponentially as more features are added to the system. When performing systematic testing, you have to take into account that each feature needs to be tested with others and each feature may have multiple functional parts to be tested along with variations in calling hierarchy. When it comes to Dynamics 365 systems any business logic that calls the platform API is a performance variable that can be taken into consideration.

As an example of a systemic collaboration is in a tug of war competition where two teams pull a rope in opposite directions. If the members of a particular team aren’t working in unison, they will have a much harder time performing well as a team. In the worst case scenario if a team has members pulling in the same direction as the opposing team then it’s more likely that they will fail.

Performance is a systemic concern, and as such we don’t typically notice performance problems until the system is placed under stress as a whole. To make it even more exciting problems may not occur until certain features subsets happen to be activated simultaneously. Depending on the usage scenario one or more of these feature subsets may be used less often making issues appear random.

Functional Affinity vs. Functional Antagonism

Functional affinity is the degree in which two or more systemic functions can operate symbiotically. From a project stakeholder perspective, there is an unspoken natural assumption of functional affinity. When requirements for various features come up they automatically assume each feature will work nicely together. The smaller the business automation footprint of a system the less there is to worry when it comes to the question of performance. As the business automation footprint increases, however, we need to be much more careful about how we implement business requirements.

Functional antagonism is the degree in which two more systemic functions oppose each other. If you consider for a moment how software projects are typically developed, the concept of functional antagonism shouldn’t be much of a surprise. You have a group of developers who are assigned various system features to build. Each developer or development group is focusing on building out their particular part of the system. Another developer may create a piece of functionality that works fine when tested independently. This feature may even work fine in standard testing scenarios together. What happens when 3,000 users simultaneously activate those features together? Now take an enterprise system with dozens or more features that now need to operate together within the system.

Understanding Load Anti-Patterns

During project development, it is the technical lead’s (or architect’s) job to oversee the system architecture and functional affinity as a whole. While developers are focused on the implementation of specific features, the technical lead keeps a watchful eye on the big picture ensuring the harmony of the parts. An effective technical lead must be able to analyze code, be well versed in design patterns and distributed system architecture. Over the years I’ve observed a common set of load patterns that were found to be the cause of most major system performance issues. The key to preventing performance issues is first to understand what these load patterns are and be able to spot them during the development phase of a project before deployment. The lesser alternative is to address them after the fact and under the duress of angry users and business stakeholders.

It’s Going to Get Worse Before It Gets Better

When I look at the expansion of the Microsoft business platform, I feel a sense of awe at the speed in which the product teams are moving. Back in the day you only had to worry about a single platform most of the time. As the Dynamics integrations into Microsoft ecosystem continue to grow, we’re presented with an increasing number of implementation scenarios. From Microsoft Flow, the Common Data Service, Power Apps, external add-ons and everything else we have what seems like an infinite number of performance variables to consider. Many solution architects don’t have a full appreciation for the variables that play into system performance so as the Microsoft Dynamics Universe (MDU) expands I expect to hear more stories of performance failure.

At the moment we have to accept the fact that it is easy to analyze performance issues in an on-premises environment because we have direct access to the database. While Dynamics CRM online customers can create support tickets to performance traces, it is a more convoluted process during what is usually considered a time of urgency. I expect that the Organization Insights tool will eventually provide a high degree of insight into the database operations that will allow customers to self-diagnose performance issues happening in their organization databases. Until then our best bet is to prevent performance issues them before they happen. The unfortunate fact is it’s much hard to appreciate a problem that never happened because it was already addressed.

Categories: Architecture, Blog

Tags: , ,

%d bloggers like this: