Thursday, December 25, 2008

Performance issues in querying XML in SQL Server

One of the challenges in the Canadian broadcasting industry is to prove to the CRTC that what a broadcaster aired on television is actually what they wanted to air, and if a discrepancy exists, have the ability to explain it. To do so, broadcast trafficking systems like S4M and SDS generate playlists and asrun logs, which must be reconciled per CRTC regulations. I was tasked with developing a simple application to reconcile these playlists and asrun logs with each other. Since the playlists were provided as XML files, I stored the playlists in a SQL Server XML column so when provided with an asrun log, I could always go back to the database and find the associated playlist easily with a SQL query. Playlists were linked to asrun log files via an event id, which was an attribute in the XML playlist file but a field on a flat text asrun log. This meant that every time I needed to verify the event record in the asrun log had a matching playlist event, I would have to parse out the event id from a flat text row, then query the XML column for an event XML node with an eventid attribute with a matching value.

That was the business problem I was attempting to address. To translate it into SQL syntax, I had to execute the following query every time I wanted to reconcile an event record:

SELECT Playlist.query('for $A in /dlprog/dldate/dlevent
where $A/@eventid[.="79576616"]
$A '
FROM [Corporate_Applications].[dbo].[BIP_Playlists]

(Many thanks to Yihong Xu, our trustworthy data architect, in providing this query).

The Performance Issue:
If one examines the above SELECT statement query, it will become obvious that the records are not filtered with a WHERE clause. Instead, all the records are subjected to a XPath query in the SELECT clause. So when I innocently applied this query in my code, my reconciliation of the asrun log against the playlist took approximately 14 minutes.

The Solution:
Performance issues are normally rooted in architecture, so I revisited my application and in particular, my data architecture. Obviously, the first thing I focused on was reducing the time taken for the query. Since all the querying was done in the XPath expression in the SELECT statement, I had to pull that logic out and put it into a WHERE clause. However, the existing data architecture would not allow me.

Therefore, the only choice I had was to create another table to hold the eventid information I would be querying against in its own column, as opposed to buried in the XML data in the XML column, as well as the related values I would need to SELECT from the dlevent node. I then populated this table as I read the playlist files, as at that point I was inserting the entire playlist file as a XML field into the SQL Server table anyway. Next, I adjusted the above SELECT statement to query against the eventid column of my new table, and the resulting performance improvement was staggering. The resulting reconciliations now took roughly 1 minute, meaning a performance improvement of about 14 times.

Lesson learned
XPath queries embedded in SELECT clauses are very inefficient. If there is an attribute in a piece of XML data you will constantly be querying, create a separate table that holds that attribute value as a column, along with foreign keys to link back to the original table with the XML column.

Looking back on the problem, I would have done a deeper analysis of the data architecture to ensure that the table structures were optimized for SELECT queries, as opposed to just aiming for simplicity. I would also have used a different content management system to archive the playlists, such as Documentum, since it more naturally supports management and chunking of XML data, as well as indexing the XML data for fast retrieval.