5 Tip for Troubleshooting Data Wrangling Problems

I’ve never found a great way to explain what I do. Analogies help right? The first analogous thing that pops into my head is a Rubix cube. My day to day work feels like trying to solve a Rubix cube over and over, and the cube is a big pile of dirty, messy data. Thus, much of my day to day is troubleshooting data wrangling problems.

Some days, it’s incredibly satisfying, as solving a Rubix cube is. Other days, I want to rip the stickers off and throw it out the window. When I drafted this post, I was in “rip the stickers off” mode. I was frustrated with how long it took to solve data wrangling problems. When I get in that mindset, I want to share what I learn so other people don’t suffer as much. Thus, this post discusses my top 5 ways to approach troubleshooting data problems. These methods will apply no matter what application you are working in or if the problem is related to data wrangling or logic problems. Read on for details.

My 99 Problems

As some readers know, I’m now using Alteryx workflows for many data wrangling tasks. When I drafted this post, I was troubleshooting issues originating in Alteryx using a combination of Alteryx tools and Spotfire visualizations. More specifically, I was pulling data out of WellView for our Completions team and attempting to join up the following data sets.

  • Well header data.
  • Completion Job — Pull all “initial completion” job information.
  • Daily Operations Reports — Pull all the daily report data for well’s “initial completion” job.
  • Time Logs — Pull the time log data for each report. There will be multiple time log entries per report.
  • Phases — Pull what phase a well was in when the report happened but also show the report and time log data if a well was not in a particular phase.
  • Rigs — Know what rig was on the well at the time of the report. Make sure to get the right rig if more than one rig has worked the well.
  • Pull in the well grouping information too.

Yeah, it was a bit complicated. I built the workflow but was having issues getting each and every record and eliminating duplicates. For those who are Alteryx users and appreciate a large workflow, here’s what my end result looks like. The yellowish containers are QAQC containers, most of which are actually minimized in this screenshot.

Before I get into how to troubleshoot, I want to touch on troubleshooting efficiency.

Efficient Troubleshooting

In last week’s post, I talked about how time is our most important asset and waiting destroys innovation. Thus, it makes sense to talk about efficient troubleshooting. There are steps you can take to make sure feedback is received as quickly as possible.

  1. List or define test cases or use cases.
  2. List or define the wells (or widgets or customers) that meet the test cases.
  3. Create clear column names.
  4. Minimize the feedback loop time.

List Test Cases

The test cases or use cases are all the sticky bits where data could possibly go wrong. These are the things you want to make sure are correct. If not, they turn into gremlins in your data wrangling. Early on in my troubleshooting process, I little problems that I fixed as I went along, but because I didn’t list out my test cases, I fixed one thing and broke something else. That’s what you want to avoid.

Here are some of my test cases, for example:

  • I want to make sure that if a well has a report outside of a phase, I still grab that report.
    • Ex. A well is in the PreFrac phase from 1/4/19 to 1/10/19 and the Frac phase from 1/15/19 to 1/30/19. I want to make sure I also bring in all the reports that happen between 1/10 and 1/15. If I modify a piece of logic, I need to make sure these records still come in.
  • If wells have more than one rig, I need to make sure I grab the rig that was on the well at the time of the report. When a well has 2 rigs, records get duplicated, so I have to make sure I am removing the right records.
  • If a well’s phase doesn’t have an end date, I need to make sure I don’t lose that record. If I modify a piece of logic, I need to make sure I don’t lose these records.
  • If a well’s phase occurs after all reports are filed, I need to make sure I still have a record showing the phase.
    • Ex. A well stops daily reports on 1/31/19. The Flowback phase starts on 2/4/19.

List Content to Test

With the test cases defined, I can now figure out which wells meet these criteria. I will check these wells and their records at each step in the workflow and anytime I make logic changes.

Create Clear Column Names

A few years ago, after a particularly complicated Spotfire project, I wrote a post entitled — There Are Only Two Hard Things in Computer Science. I wrote it because my biggest regret in this project was not creating a column naming convention and sticking to it. Alteryx makes this really, really easy and there is no good reason not to create and consistently use a column naming convention.

As it relates to troubleshooting, clear column names will make understanding where data is coming from much easier. Rather than having 4 columns named “start date”, I have — “rig start date”, “report start date”, “phase start date”, and “time log start date”. Thus, I know exactly what I am looking at. It’s easy to waste time trying to figure out your data. Make it easy on yourself.

Minimize the Feedback Loop Time

Time is our greatest asset. Therefore, we should attempt to save time whenever possible.

In Alteryx I made sure to…

  • Use the “Cache and Run Workflow option” so the workflow isn’t querying the data source every run.
  • Compartmentalize the workflow with containers.
  • With my data in containers, I could close any unnecessary containers or close any containers to the right of where I was troubleshooting.
  • Place QAQC containers at test points in the workflow to filter down to my test wells.
Here I have a QAQC container attached to a step to remove duplicate records. I have filtered to specific wells and count records at this point.

Now, the reason you came to this post….methods for troubleshooting.

Methods for Troubleshooting

These are the methods I have used over and over to solve logic, calculation, and data wrangling problems regardless of the platform I am working on.

1. Explain it to Someone

“Verbally explaining a concept really does help you to better grasp it…That’s because we all have an intuitive sense of what makes a thorough explanation, but we often neglect to generate one for ourselves. The query of an outsider forces us to replace our false feeling of understanding with actual reasoning.”

Live Science

2. Partner Up

Partnering up is a little bit different than explaining it to someone. This method for troubleshooting actually involves collaboration on a problem. By partnering up, you can bring a new and fresh perspective to the table or that person can help you think differently to solve the issue. When I’m just stuck on something, there are a handful of people I can call on to work with. Calling those people used to feel like a failure, but this method works. Now, I realize it’s just more efficient.

3. Draw a Picture

Sometimes, when working thru architecture problems, I like to draw out what exactly is happening. In the example below, I used PPT to illustrate how data moves from one table to another. This helps me think in architecture blocks rather than rows and columns. It’s also handy documentation later on.

4. Put It Into Words

When struggling with logic problems, (case statements and IF logic), I write out the logical paths in words rather than code or a calculation. This helps me focus on capturing all the possible paths rather than correct syntax. We don’t think in code (at least most of us don’t), so when it’s complicated, work with words that come more naturally.

5. Make It Smaller

For complex calculations, it really helps to break things down into their smallest parts. One calculation or calculated column for each piece of the puzzle. I do this in Spotfire frequently to make sure each piece is providing the desired result. Here is an example of a complex calculation that I would break into at least 3 calculated columns.

sum(If([c.rank]=Min([c.rank]) over ([job idwell]),If([job end date] is not null,1,0),0)) over ([master pad name])

Here’s how it could be broken into smaller parts.

sum(

If([c.rank]=Min([c.rank]) over ([job idwell]),

If([job end date] is not null,1,0),0))

over ([master pad name])

Furthermore, consider leaving those broken down parts in the project. Hide them if you can. They may be helpful for future QAQC even if your actual work is connected to the “complicated” version.

Before wrapping up, I want to say a word about preventing rework.

Preventing Rework

It’s not uncommon to work thru something complex, finish it, celebrate, and move on. However, maybe it happens that your logic wasn’t as flawless as first thought, and it needs to be revisited. This trip down memory lane can be costly in terms of time. But, if you document the problems solved and how you got there, it’ll be easy to pick up. You’ll know what not to try again and how everything fits together.

Lastly, these types of references are just good documentation. I’ve built several major projects that I handover to users. I don’t work with them day-to-day, and so I forget how they are built. When the user asks a question 6 months later or wants something modified, these references are solid GOLD.

I hope you found these tips useful and feel better prepared when troubleshooting data wrangling problems. Please share this link and/or comment and share your own tips for troubleshooting.

If You Liked This

If you enjoyed this content, take a look at my series on Documenting Spotfire projects. Here is a link to the first post in the series.

1 thought on “5 Tip for Troubleshooting Data Wrangling Problems”

  1. Pingback: Exploring the Spotfire API with IronPython » The Analytics Corner

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.