Many of my automation projects revolve around getting data out of PDFs. I’m working on a large PDF processing project right now that will generate several how-to posts. I think this will be a fun way to show the details of a large automation project. In the first post, I’ll focus on how to bring a column of keywords into PAD from Excel, transform that data from a table into a list, and remove nulls. This lays the groundwork to search PDFs for keywords using OCR, which I’ll dive into in the next blog. Check out this post for the details.
Use Case for Searching PDFs
To begin, let me explain my project. I’m working with our Tax & Accounting group to make sure we properly classify costs. We have a transactional finance system and processing on every line item detail in every invoice. But, if vendors don’t put detail in the right place on an invoice, it can be difficult.
Without automation, analysts have to open hundreds of invoices and manually look for keywords. Clearly, that’s terrible.
Now, this is a work in progress. So, there might be additional automation steps to download or grab PDFs from a different place. For now, I start with a folder of PDFs. Power Automate Desktop will perform the following tasks.
- Pull in a list of keywords from Excel
- Open PDFs one at a time
- Use OCR to convert each page to an image and search the image for text
- Loop thru the text from each image to identify keywords
- Write the results of the search in Excel
This blog post will focus on the keyword list in terms of where it originates, how I get it into PAD, and how it works.
First, I created a simple Excel file with one column for tax analysts to enter the keywords they want to search for. Now, I can provide as many rows as they want. And, they don’t have to use all of them. But, I must have a defined number of rows going into PAD. Since this is still a work in progress, I provided 10 spaces.
High-Level Steps for Keyword List
The automation will perform the following steps to create the keyword list in PAD.
- Create a blank list in PAD.
- Open Excel.
- Get the first free row in Excel to know how many keywords have been populated.
- Set a variable to hold the keyword count.
- Read keywords from Excel into PAD.
- Close Excel.
- Transfer the keywords from a data table into the list.
- Remove nulls from the list with a loop using the keyword count.
Here’s what these steps look like in PAD.
Also, I use a few input variables, one of which is called Available_Keywords. This is the maximum number of keywords that can be processed. As mentioned, I provided 10 cells in Excel for a maximum of 10 keywords. I’ll explain the input variables as I use them.
Create a List
Now, the first step creates a list called Keyword_List to hold the keywords from Excel.
Work With Excel
Next, I go thru 5 steps to get the keywords from Excel into PAD. Step one launches Excel using a file path stored in a variable named Keyword_List_Location.
Next, I ask PAD to find the first free row in column A. The result is stored in the variable First_Free_Row, and it will be a number. So, if there are 5 keywords in my 10 available cells, then First_Free_Row will be the number 6.
Now, I know how many keywords exist, so I can set a new variable called Keyword_Count to hold the number of keywords. I calculate it by subtracting one from the First_Free_Row.
Then, I tell PAD to read from Excel, starting at A1 and ending at A10 using a reference to my Available_Keywords variable.
When PAD reads data from Excel, it creates a data table variable called Keyword_Data to store the data. I know it is a data table because I can inspect the flow variable and see the type of object or variable created. But, I don’t want a table. I want a list. We’ll fix this.
With the data in PAD, I can close Excel without saving the file.
Convert Data Table Into List
Now, I have the data in PAD. But, it’s stored in a data table. I need it in a list because I want to loop thru all the values in a list. Looping thru a data table is much harder. So, I use this step to convert the data table column into a list called Keyword_List.
Next, I loop around the Remove item from list action to remove nulls from the list. And, I’ll admit that I tried several other methods and actions to remove the nulls. This is the only thing that worked. I wish I had documented some of the things I tried that failed, but I didn’t.
So, my loop condition looks at the count of items in the Keyword_List. That number will start at 10 because I provided 10 keyword placeholders. The count of items will decrease as items are removed via the loop. Once the count of items becomes greater than the Keyword_Count of 5, the loop stops.
For clarity, you can see the data table brought into PAD from Excel with nulls.
And this is the result after the loop and remove steps. In theory, I could have left the nulls in the list, but that would needlessly increase my processing time.
What Did We Learn?
So, to wrap up…This post showed you how to….
- Move data from Excel to Power Automate Desktop
- Transform the structure from a data table to a list
- Remove nulls from that list
In the next post, I’ll show you how to use OCR to get text out of PDFs and search PDFs for specific keywords.