Yes, all my fellow “geeks” out there in the world probably expected the phrase “the power of the emperor,” or perhaps, “the power of the dark side.” But, stick with me here, and something I learned this week about a great text editor that supports regular expressions might help you in the future.
This past week at work, I found myself dealing with a data issue for our Miami, FL operation. Basically, we’ve got some data catch-up and cleanup efforts happening for our submissions to one of our contracting agencies. We upload text files created by an export from our practice management software. But, due to a change in our relationship with the agency (we’re now a subcontractor of an agency that now stands between us and the final recipient of the data), we’ve had to change the way we submit our data. Certain assessments and outcome data that we used to submit directly to the final recipient of the data through their web portal we now must submit via these FTP text files. So, we had to work with our third-party software development company, Intrepid Services, LLC, to build one new export and update 3 others. The problem was, there is a fourth export we didn’t touch…and we needed it this week.
Basically, the export builds a simple text file in order to update the Social Security Number (SSN) of one of our clients in the contracting agency’s database. If we incorrectly keyed an SSN, we use this export to replace that incorrect SSN with the correct one. Each line item has four pieces of information: A Contractor ID , the old SSN, the new SSN, and the Provider ID for the agency serving the client. In the past, we would have put the same ID in for Contractor ID and Provider ID because we were a direct contractor with the contracting agency. Now, we have to put our ID as the Provider ID since we’re a subcontractor, and the ID of the agency for which we subcontract in as the Contractor ID.
So, I was in a jam. I had to get data uploaded on Friday for us to stay anywhere close to on schedule, and my developer couldn’t build anything fast enough to make a difference. So, I pulled out my trusty text editor, created an SSN update file from our existing export, and was ready to do a find and replace to correct all the incorrect Contractor ID values in my gigantic text file.
But, OH NO! Each line has our ID number TWICE; once INCORRECTLY for the Contractor ID, and once CORRECTLY for the Provider ID. A find and replace would replace BOTH instances of our ID. What to do!?
Then, I noticed something. I use a text editor called TextPad, made by Helios Software Solutions. This editor does a LOT of great things, such as allow you to display line numbers, and it will highlight code for PHP and other languages in such a way that you can tell if you’ve properly terminated strings and ended lines, etc. You can buy a single user license of this editor for $27, and I’d say it’s a bargain at that price. What I noticed was a checkbox in the “Replace” wizard of TextPad with a label that reads, “Regular expression.” In the course of learning to program in PHP, I learned a little about using regular expressions. Regular expressions are “a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters (Wikipedia).” I was thrilled.
To make a long story short (I know, too late), I was able to search for ONLY the FIRST instance of the string I wanted to replace on each line by using the following in the “Find what” box of the Find/Replace wizard:
^[0-9-]\{10\}
Let me break down what that means in regular expression (also called “regex” for short) terms:
^: matches the start of the string the regex is applied to.
[0-9-]: Anything in between square brackets ([ ]) defines a character class. Here, I’m saying a number from 0 to 9 or a hyphen.
\: The escape character. This is a little strange. Normally, the escape character suppresses the behavior of a special character in regex. But, because of the way TextPad is implementing regex, I had to escape the curly brackets ({ }).
{10}: Forget the escape character for now. In regex terms, {x} means the preceding defined item repeats x number of times. So, Here I’m saying that any number from 0 to 9 or a hyphen repeated 10 times*.
* These notes on regular expressions borrow from text at http://www.regular-expressions.info/reference.html.
So, ^[0-9-]\{10\} all together means, “Match starting at the beginning of the string any 10 characters so long as the characters are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, or –. Since our ID is in the format of a business tax ID (e.g., 99-9999999) that’s what I needed to match.
In the “Replace with” field of the text editor, I just put the ID of the agency for which we are now a subcontractor. Then, I clicked “Replace All.”
TextPad treated each line as a separate string. It replaced the first instance of our ID in each line (which, you’re probably inferred, is at the very beginning of each line), but left the second alone.
Then, in the course of the same project, I realized that I needed to query our database to get some data about a large set of our clients. The agency had sent us a report with the client IDs for our practice management sofware included. So, I pasted the list of client IDs into TextPad. Those client IDs take the form of a letter, followed by 5 digits. I had something that looked like this:
F00001
F98765
F54637
F65432
And so on…but a lot longer. To query our database, I needed to get this into a SQL query looking more like
(
‘F00001’,
‘F98765’,
‘F54637’,
‘F65432’
)
But, with this regex feature, that was NO PROBLEM. I used ^ to match the beginning of the string, and replaced it with ‘. That didn’t replace the first character, but instead shoved the ‘ in front of the existing string. Then I used $ to match the end of the string and replaced it with ‘,. The $ sign is the regex symbol for matching the end of a string.
That got my list so that each line looked something like: ‘F00001’,
All I had to do was go to the last line, and delete the comma. Then I had strings encased in single quotes, with each line followed by a comma except the last, which I pasted into Microsoft Query Analyzer on our database server to query for the client data I needed.
I could have spent time carefully cutting and pasting to fix this problem, but it would have taken a long time and I might have made mistakes selecting characters to paste over. I probably could have gotten creative with Excel and used the data import features of that software to separate the lines of my text file into fixed width chunks ported to separate columns that I then could have updated. But, regex in a text editor was the solution I stumbled upon that made the most sense.
If you find yourself in a similar spot, give it a shot. If you aren’t familiar with regex at all, start with a website like http://www.regular-expressions.info/reference.html and learn some of the basics. You’ll find yourself using this pattern-matching method to solve the most confounding problems with data. I know I have over the years!