Jul 24, 2013
The reality of working with suppliers for many merchants can be a daunting and time consuming task. Take any domain and chances are you’ll stumble across an industry either deeply stuck in the past, or at the very least struggling to keep up with modern technological advances. A merchant may achieve some level of success attracting customers and converting carts with inventory at a certain level X, but they know that if they could just add more inventory, perhaps the customers and sales would ramp up too.
Merchants are often presented with inventory feeds from suppliers that they can use to present more inventory in their shop than they keep on hand or in their direct possession for sales. Most of the time these feeds are available in XML or CSV and are suitable for examination in a spreadsheet or text editor. Sadly most merchants remain unable to use these feeds as they do not magically import into Shopify or any other e-commerce platform for that matter. This data remains in the wild unless some computing can be applied to it to make it possible to add it to a shop.
A Pattern for Processing Data
Suppliers can often provide data feeds to a merchants via FTP or rarely as an online feed that can be accessed using HTTP. To bridge the gap I provide merchants with access to a custom Dropbox where they can upload their feeds in whatever format the supplier provides. That way the merchant can function at an even lower level, via email.
With a data feed at my disposal I can download it using a script so that it can be parsed, to tease out the nuggets or jewels containied within. If the file is CSV text it can be parsed using the Ruby’s CSV library. If the feed is XML Nokogiri is excellent at beating the XML into submission. It remains rare to have access to JSON which is unfortunate, but signifies how most suppliers still depend on outdated enterprise platforms incapable of pumping out JSON.
Often the suppliers data is dirty and needs cleaning. A quick example is that they may provide 10,000 inventory quantity numbers for variants along with a SKU that can be used to find the variant, but the reality is the shop only contains a small subset of those SKU codes. Instead of using the Shopify API to search for these SKU codes, I first produce an intermediate file that sets up all the work to be done with the actual shop inventory. Once this file is prepared it can reduce tens of thousands of API calls down to the minimum needed thus ensuring any inventory update be as easy as possible to manage.
After the data preparation phase I now have some tough decisions to make about how to use the Shopify API. Since a script only gets 500 API calls per 300 seconds, any script updating inventory has to be able to gracefully handle this limitation. The typical approach of detecting a 429 Out of API Calls and then going to sleep is the worst in my opinion as it ties down a worker thread for no reason and hence does not scale nicely.
I commit my work to a cache that my background jobs can access. So the first time a background job to update inventory commences, it reads the cache looking for work to do. If work is found, the script hits the API and chews through the work until the limits are reached. At that point the cache has had all the work completed removed from it and it is smaller. The background job schedules itself to restart in 301 seconds and terminates itself. Once this process has chewed through the whole cache leaving an empty cache, it then emails the merchant with a comprehensive report on the inventory updated, and it terminates until the next inventory update is scheduled. For convenience I provide the merchant with a manual button they can click to initiate an update, or I set a job to run at a scheduled time daily or hourly.
Some merchants are using this pattern I setup for them to process hundreds of suppliers and tens of thousands of variants for their shops. The pattern has proven itself to be invaluable since it can quickly be tailored to handle many kinds of data feeds, formats and quirks. It leverages scripts that access the API using background jobs capable of scheduling themselves with respect to API limits and by leveraging a data cache on prepared datasets, no complex data persistance issues creep into the algorithm.