Analytics Blog

Dsg Subscribe Mail Web

Blogs (Analytics Blog) (Analytics Blog)

ETL and Python Integration

Erik Chen
4 Months Ago

My background is in computer science and I started here at Dunn Solutions roughly back in September. This is my first contribution to the BI blog here and I was excited to pick a topic that relates back to my major. Since then, I’ve had a chance to work with a variety of things and have noticed that if you work in BI, then you’ve likely worked with SAP Data Services. It is a visually driven program with drag-n-drop interface, which facilitates quick building of solutions without having to dive into writing SQL to accomplish what you need to be done. While such interfaces may be simple and powerful enough to get the job done, it isn’t always easy coaxing a program to do exactly what you want it to.

Let me give an example that I have recently come across while working with a client. Let’s say that in SAP Data Services you needed to process a large dataset. This data set potentially contains unprintable characters, remnants of HTML tags, or profanities. The data will ultimately get thrown into a report and you wouldn’t want HTML tags, strange symbols, or profanities appearing in it.

In SAP Data Services, there are lots of table transforms and scripting functions available for use. Of special note is the ability to create a user defined transform using Python scripting. However, as a programmer, working with Python from within Data Services is very frustrating. The problem with Python scripting in SAP Data Services is that in order to use data from columns, they must be mapped as inputs in the transform. There is also the lack of support for third party modules. I have not been able to find anything indicating how one could install and use third party modules in the built in python editor.  In a regular python dev environment, it would be as simple as installing it with pip and adding the import line to the top of your file. Compatibility is also limited to Python 2.7, as that is what the Data Services environment supports.

If you do decide to do your Python scripting outside of Data Services, there is a great third party module called pymssql that allows you to connect to a Microsoft SQL Server DB to execute queries. With this and the other built in python modules, it is possible to perform complex processing of your dataset while interacting with the database.

The above is all it takes to connect to a database, execute a query, and view the results using pymssql. The “autocommit” flag indicates that changes to the database will be committed immediately if the query succeeds. The “data” parameter of the execute statement is a tuple of values that matches to a string formatter in the SQL statement. For example, every “%s” in the SQL will map to one element in the “data” tuple. One can also use the “as_dict” parameter in the cursor creation if they wished to access results by name. After execution of the query, results can be obtained by iterating over the cursor.

While it is nice being able to freely manipulate your data, you take a performance hit when large numbers of records must be persisted to the DB. Pymssql does not support caching but it does support cooperative multi tasking. The only difference between using execute() and executemany() is that the latter accepts a list of tuples and will call a single execute for each tuple in the list. In other words, you would have to build your own SQL if you wished to aggregate multiple queries into one. A good tradeoff between performance and versatility is to use your ETL tool of choice in conjunction with an external Python script. The ETL tool will handle major transforms and DB interactions while all the heavy processing is done in your script.

This is but a peek of what is possible with integrating Python into your ETL jobs. I hope you find the Python language as great as I do.