Background
Couple of weeks back, one of colleagues specializing in Data Platform was working with a customer, who uses Azure to support many of their customer facing applications. They use Azure cache for one of application and the scenario the challenge they were facing was to keep the cache updated with part of the information coming from a legacy mainframe application.
The data was being exported as CSV file. The challenge was speed up the initial upload of the large file containing ~120 million records into cache. This process was taking nearly 10 hours based on the a console application built from a sample code found on the Internet which was running on their local network..
Then, there were also subsequent updates (smaller files), which were expected to run multiple times a day.
Upon evaluating code, we determined that the program was only writing one key at a time and had an over-engineered the use of threading. They also had a slow outbound network to Azure to contend with.
So we decided to address these limitations in a single solution using Azure Functions.
Solution
Using Azure function and BLOB trigger, the cache update process is started as soon as new extracted file is available on BLOB storage.
When the Azure Function detects the change, the BLOB is provided as a file stream, which the function:
- Reads line by line
- Converts the line into key-value pair,
- Batches up the key-value pair into a configurable set.
- Writes the batch to Azure Cache.
Result
The result was quite encouraging. With a batch size of 2000 items, 120 million items was processed under 20 minutes.
Code
Considering this a common scenario for applications using Azure Redis cache and solution that others may find useful, I have made the code available in GitHub.
1 comment:
Hi !
Awesome blog! This is exactly what I need to do today. I understand all of the code, but where do the webjobs fit in? I don't see any code referring to it.
THanks
Mike
Post a Comment