Jump to: navigation, search

Debugging UBIK


Revision as of 12:22, 28 June 2023 by NWE (Talk | contribs)

One of the most complex challenges when working on any software project is to debug unintended behavior. In UBIK®, there is an inherent structure to every project, which we can exploit for debugging. Let's find out, how.


A general policy for debugging

Our immediate goal in debugging is not to fix the issue. Instead, we want to find out why it behaves the way it does. Additionally, we must learn what the designated behavior is. This might be more complex than anticipated originally. Only then, we can change the underlying code or configuration to achieve the desired behavior. We can manifest this insight as a general policy for debugging:

  1. Find out how to reproduce the issue reliably
    1. Ask the reporter how they reproduce it
    2. Test it ourselves and improve the reproduction if possible
  2. Find the cause for the current behavior
    1. If we get an error message, we can try to search the internet for it. Maybe somebody else has had the same problem.
    2. If this didn't help, we need to look ourselves.
    3. Try to visualize what steps the algorithm goes through in the code
    4. Create a working hypothesis what' going on: "I think what's going on is... !"
    5. Find a good entry point for debugging in the code
    6. Attach the debugger of an IDE (like Visual Studio) to the process if possible
    7. If this is not possible, try to generate log output or add debug output to the UI
    8. Inspect the steps that are gone through by the algorithm (either creating log entries, or by stepping through with the debugger)
    9. Inspect the state of involved variables throughout the algorithm (either creating log entries or by looking at the variables with the debugger)
    10. Now we adapt our hypothesis, optimize the debugging and repeat the process until we learn what is happening.
  3. Find out the desired behavior, instead of what is happening currently
    1. In some cases, this is complex. We mustn't be afraid to think this through thoroughly, and ask responsible persons if we are not in the position to decide it.
    2. Define the functional design (i.e., a suggestion for the desired behavior) as clearly and simply as possible.
  4. Create a technical design for the solution
    1. If we now the designated behavior, we can describe how to achieve it technically.
    2. That mostly means:
      1. Basic idea
      2. What modules are involved?
      3. Changes to the data model
      4. Changes to the algorithm (i.e., workflow logic)
    3. Define the technical design as clearly and simply as possible.
  5. Implement a fix
  6. Retest the fix using our reproduction

This is basically independent of the product or framework you're using. With UBIK®, we can get more concrete.

Debugging UBIK®

The first step, namely to find a reproduction, stays the same as in the general case described above: Ask, test and refine. The general approach to finding the cause, namely by improving your hypothesis and inspecting what's going on, is still valid, too.

However, there are some considerations we can specify with respect to UBIK®.


[edit]

Visualizing the algorithm

In order to find out what's going on and to debug efficiently, we must be able to imagine the workflow and architecture of the use-case.

In UBIK®, the behavior of any project's use-case can be distributed across multiple products, i.e., the client application with its UI customizing, and the server products including the database, the Enterprise Service, the UBIK® Web Services, and UBIK® Studio, any Plugins and Server customizing consisting of the data model, configuration objects and custom code.

The UBIK platform architecture

A good next step is to try and find out how the affected use-case is implemented. Some use-cases are very simple, but in many cases, there are quite a few modules and steps involved. We want to answer the questions: Which products and modules were used, and how do they interact?

Nearly all use-cases in UBIK® projects are either related to the mobile client or to interfacing with 3rd party systems. Though the specific implementation can be very different from others, the general flow of information throughout UBIK® modules will almost always be similar. If there is a problem, it has to occur in one of the respective steps, caused by one of the listed dependencies.

Mobile client

  1. The mobile client requests data from the Content web service, using
    • Hardware (network, client hardware)
    • Profile settings
    • Credentials
  2. The web service establishes a connection to the UBIK Environment, using
    • The network
    • The web service configuration
    • The database
    • Injected UBIK® Plugins
    • The programmatic customizing
  3. The web service tries to perform the requested action based on
    • Hardware (network, server hardware)
    • The data model
    • The programmatic customizing
    • The View configuration
    • Content data
  4. The client receives the result and tries to display it depending on
    • Hardware (network, server & client hardware)
    • The XAML customizing
    • The ACM meta definitions
    • Content data

Interfacing

Similarly, interfaces to other systems like SAP or Comos usually perform the following workflow:

  1. Somebody (or something) configures a UBIK® Enterprise Service task using its web service interface or a configuration file.
  2. An ES run is triggered, most likely using the Windows Task Scheduler.
  3. The ES establishes a connection to the UBIK Environment, using
    • The network
    • The web service configuration
    • The database
    • Injected UBIK® Plugins
    • The programmatic customizing
  4. The ES tries to perform the requested action, usually based on
    • Hardware (network, server hardware)
    • The data model
    • The programmatic customizing
    • The Proxy configuration
    • Content data
    • The external system

In this case, the UBIK® Proxy mechanism is an additional source of complexity; but there's a separate article for that.

Hypothesizing

If we know what our basic algorithm looks like, we can try to formulate an idea what could have gone wrong. Optimally, we actually go and look for a proof, to see it happen in action, but it's always good to know potential error sources. In general, there are several common types of problems, and from another perspective, a set of common sources for such problems.

Types of problems

Performance issues

Performance issues can be caused by:

  • Hardware problems, e.g., slow network or weak devices
  • Huge loads of data
  • Inefficient algorithms (with poor scalability)

A combination of the above is no rarity.

Sometimes, weak hardware is something we cannot change easily. In many cases, we can optimize our algorithm to make it perform well even on weak hardware. In other cases, we can restructure the problem or the data in a way that makes it easier to process. Before we can solve the problem though, we have to find the cause. For now, this means finding the bottle neck. Is the internet connection too slow, do we work with too large amounts of data, is our interface algorithm taking too long, or do we launch too many web requests?

Crashes

Crashes often are caused by bugs in the software. In some cases customizing could be able to crash the app, too. In case of a bug in {{UBIK}, once you have identified the faulty module and pinned down the reproduction, please notify the development department by creating a support ticket. Until the bug is fixed, you might want to find a workaround to avoid the crash.

If the customizing is at fault, the most likely source of a crash is an unhandled exception. E.g., a null-reference exception: the program tries to access a property or function of an object that actually is null.

Missing or erroneous data

Data that doesn't look as we expected it can make our customizing (both on the server and on the client) misbehave. The simplest example is a value being NULL when we assume it isn't, which usually leads to a NullReferenceException (and potentially, even a crash) in C#. But there are many other potential problems, for example a value being outside of an expected range or a value being technically acceptable, but semantically wrong. E.g., for some reason we could have a document "chair.jpg", that is a picture of a table, because the import confused a mapping.

Missing or erroneous data can be caused by:

  • The data having been imported erroneously from an external system
  • Data was input wrong manually
  • A misconfiguration transforming the data or making the data unavailable on the client (or other presenting module)
    • User rights
    • Faulty filters
    • Incomplete ACM/View customizing
  • A bug in the UI customizing, so the data is just hidden or presented wrong

Other misbehavior

The app or interface just doesn't behave as expected. The technical concept is sound, the input data looks fine, but the result is wrong. In this case it's a good approach to get a complete list of prerequisites and check one thing after another, separately. Surprisingly often, it's something like a typo in the settings.

If you've been looking for ages and still can't find the error, you probably need some distance from the problem. Make a pause, sleep, do something else. Then, try to get an overview and a plan before you get back to debugging. Even the most hopeless problems usually look different after a good night's sleep.

Finally, if you can't find a good hypothesis - that's not a problem. Try to inspect what's actually happening instead, and the hypotheses will follow.

Modules and problem sources

Different types of problems are not the only categories we can think in. UBIK® is a complex ecosystem with multiple products and many modules, and it's using other products and frameworks to do its job. Hence, the problem can be caused by different sources. It even can be caused in one place and, as a consequence, surface in another. Here's a list of potential problem causes (a combination of multiple points is possible):

  • Infrastructure
    • (Network) hardware problem
    • Network security restriction
    • User rights restriction
  • Client App
    • Erroneous data (unexpected values provoke the problem)
    • Wrong configuration (the profile or a configuration object coming from the server is misconfigured)
    • UI customizing (some XAML contains an error)
    • Core implementation (the app itself has a bug)
  • Web Service, Studio or Enterprise Service
    • A manual step was forgotten (rebuilding the custom code, releasing the ACM meta definitions, restarting the web service, ...)
    • Erroneous data (unexpected values provoke the problem)
    • Wrong configuration (a configuration file or an object is misconfigured, potentially including ACM and Proxy/IF configuration)
    • Plugin code (a standard or customer plugin has a bug)
    • Custom code (custom code of meta classes or the custom code library has a bug)

Inspection

Somehow, we must see what's really going on under the hood. No matter how good your hypothesis, if you can't verify or falsify it, it's no use. Even more frequently, a hypothesis is wrong and you have to come up with a better one - optimally, based on hard facts. How do we get more information about the problem?

The keyword is inspection. It means, we have to look at the state of the program, as it performs critical steps in the algorithm. Basically, this means, we want to know:

  • When the algorithm makes a decision, which decision does it make and why?
  • Where is the first wrong decision made, and how does it end up in the observable erroneous state?

Mostly, this means outputting the current values of variables, the current module and method at a point in the algorithm. It can also mean inspecting the input data or parameters for our algorithm to improve our hypothesis. There are the following ways to inspect the state of a UBIK® system:

Inspect the mobile client

  • Use the Developer Mode to inspect the currently visible view models and their values.
  • Inspect the log files of the mobile client, including the web service client log.

Inspect the web services or the Enterprise Service

  • Inspect the log files of the web service or Enterprise Service
  • Modify your plugin or programmatic customizing to output log message containing the state of your program at critical points
  • Use a Who-Bert script to test a specific setup and output log messages to the console.

Solving: Performance Problems

If you're in the technical design stage, you've already found out the reason for the performance issues. In case of a hardware or infrastructure bottle neck, you can either try to get better circumstances - or adapt to them, optimizing your solution.

Often, the bottle neck is the network connection or the mobile application. However, optimization is also required if the infrastructure is fine but UBIK® takes too long to process the use-case.

In both scenarios, we want to apply the following measures:

  • Leverage strengths instead of weak points
  • Perform as few processing steps as possible
  • Partition the problem into several smaller problems

Leverage strengths

Usually, he server is strong and fast, the mobile device not so much, and the network is a performance graveyard. If you want to waste as much performance and time as possible, then you try to maximize the amount of network interactions and shift all the workload to the client application. Vice-versa, leveraging the strengths in UBIK® means to shift all the calculation and preparation to the server and deliver the results in a most compact way to the client in one request-response cycle. Often, this means you have to create a new data model on the server to reflect what you want to show on the client, and to use programmatic customizing to prepare it. Even if you have all the data already on the server side, it often pays off to restructure it, just for the client, so it can use basic features to just show the data. So, the rule of thumb is: The less client customizing you need to do, the better.

Minimize processing steps

If an algorithm scales badly, even a super computer can be too slow to perform well.

The idea of time complexity in computer science is:

  • Every step in a program consumes a certain amount of processing time. If we have N steps, and one step takes x seconds, we have x * N seconds in total.
  • Most programs have loops or recursions (often even loops inside of loops).
  • This can lead to the situation where the number of steps N is not a fixed number but a (complicated) function of your input data n: N = f(n).
  • E.g., if you have two loops inside of each other, for each of which you iterate over all n inputs, then N = n * n, because we're doing n iterations in the outer loop and for every one of those, we do n iterations in the inner loop.
  • In this case, we can say the time complexity of our program is O(n²).

The simple consequence of this is that we should avoid situations where we have a lot of combinations.

Fortunately, there is a way to solve this: caching. Caching means remembering things you already did so you don´t have to do them again.

Our goal is to find things we do repeatedly - and then extract them and do them only once in the beginning, and later reuse the result. In the simplest case, this means you have to create a variable you can use to store the reusable result. In more complex cases, you can use a data storage structure so you can collect and retrieve your data efficiently. Quite often, you can use a hash map or Dictionary, but depending on what you need, other structures can be better of course.

Partition the problem

As an engineer, one wants to provide the best solution for a user. Often, this means as few clicks and navigation steps as possible, and all the required information on one page. However, this can be very expensive, because we have to aggregate so much data in one place. Also, there might be an even more user-friendly approach, because sometimes, too much at once isn't the best solution.

I'm going to make up a very abstract and stupid example: Consider the requirement that a user wants to choose a pair of shoes to wear, with a showcase video for every shoe. Let's imagine the user has a huge amount of shoes, like, thousands. Showing them all at once might be computationally expensive, and also it would be a bit overwhelming. Instead, maybe we can group the shoes in the (overlapping) categories indoor/outdoor and elegant/functional, and color. The user has to perform a few additional navigation steps, but on the other hand, they have to make that choice anyway. We even help them selecting a pair of shoes by leading them through the right choices. As a nice side-effect, the result consists of much fewer shoes so it's computationally cheaper to load all the videos. Optimally, the parameters for the filtering can be inferred even without the user inputting them explicitely, e.g., by looking at the wheather and the user's calendar (sunny wheather, hiking trip: probably not the rain boots).

Anyway, in some cases the use-case can be rearranged so the amount of data and information presented to the user at one point in time is smaller.

Solving: Crashes

As explained in the hypothesizing section, crashes usually happen because of an unhandled exception being thrown by some module.

The basic approach to solve crashes consists of two measures:

  • Avoid the crash by checking for the problematic circumstance provoking the crash (e.g., check for a null-reference)
  • Find out where the problem originally comes from. For this reason, also log all relevant details when the above check prevents a crash.

The second point implies that the situation leading to the crash is not the real problem. The real problem is either that the situation shouldn't occur in the first place or that the program cannot deal with that case; maybe it's a buggy dependency or erroneous input data.

Solving: Faulty data

For faulty data, we have to find out where it comes from and solve the problem at its source (or as close to it as possible). Nearly always, it's much harder to deal with erroneous data when processing it than where it originates. The reason for that being, it is much harder to infer the correct data from erroneous data than to prevent the data being wrong. Consider the following fictional example: In a project, we import a lot of text, but somehow during the import, all line breaks get lost. It's hard to find out afterwards where the line breaks should be. In that case, it's better to fix the corrupting element and repeat the import.

The rule of thumb here is: Don't try to cope with the faulty data when processing or showing it. Instead, fix the problem at the source and repair the data by reimporting.

Solving: Other misbehavior

Maybe the issue is a simple typo or wrong setting and you can fix the problem with a simple measure. Since you're reading this, the solution might not be so simple and we have to approach it conceptually.

This depends a lot on the specific problem, but most misbehavior arises because the implementation currently in place is not well-designed architecturally.

Interestingly enough, this in turn often is because of a bad functional design. This means, if one hasn't defined the behavior nicely, one can also not define a nice technical solution for it. The rule of thumb here is to re-evaluate what the target behavior should be (in contrast to what the designated behavior is currently believed to be).

If the designated behavior is clear but the implementation still misbehaves, the implementation probably has to be redesigned for clarity and architectural correctness.


See also