Tutorial - Applying troubleshooting strategies

IMPORTANT: Tutorials are intended to give you hands-on experience working with a limited set of DC/OS features with no implied or explicit warranty of any kind. None of the information provided--including sample scripts, commands, or applications--is officially supported by Mesosphere. You should not use this information in a production environment without independent testing and validation.

General Strategy: Debugging Application Deployment on DC/OS

Now that we have defined a set of tools for debugging applications on DC/OS, let us consider a step-by-step general troubleshooting strategy for actually implementing pir available tools in an application debugging scenario. Once we have gone over this general strategy, we will consider a few concrete scenarios of how to apply this strategy in the practice section.

Beyond considering any information special to your scenario, a reasonable approach to debugging an application deployment issue is to apply our debugging tools in the following order:

Step 1: Check the GUIs

Start by examining the DC/OS GUI or use the CLI to check the status of the task. If the task has an associated health check, it is also a good idea to check the task’s health status.

If relevant, check the Mesos GUI or Exhibitor/ZooKeeper GUI for potentially relevant debugging information there.

Step 2: Check the Task Logs

If the GUIs cannot provide sufficient information, next check the task logs using the DC/OS GUI or the CLI. This provides a better understanding of what might have happened to the application. If the issue is related to our app not deploying (for example, the task status continues to wait indefinitely), try looking at the ‘Debug’ page. It could be helpful in understanding the resources being offered by Mesos.

Step 3: Check the Scheduler Logs

Next, when there is a deployment problem and the task logs do not provide enough information to fix the issue, it can be helpful to double-check the app definition. Then, after confirming the app definition, check the Marathon log or GUI to understand how it was scheduled or why not.

Step 4: Check the Agent Logs

The Mesos Agent logs provide information regarding how the task and that task’s environment are being started. Recall that increasing the log level can be helpful in some cases to obtain more information with which to work.

Step 5: Test the Task Interactively

The next step is to interactively look at the task running inside the container. If the task is still running, dcos task exec or docker exec can be helpful to start an interactive debugging session. If the application is based on a Docker container image, manually starting it using docker run followed by docker exec can also get you started in the right direction.

Step 6: Check the Master Logs

If you want to understand why a particular scheduler has received certain resources or a particular status, then the master logs can be very helpful. Recall that the master is forwarding all status updates between the agents and scheduler, so it might even be helpful in cases where the agent node might not be reachable (for example, network partition or node failure).

Step 7: Ask the Community

As mentioned above, the community can be very helpful by either using the DC/OS Slack or the mailing list can be very helpful in debugging further.