Many MGH staff have had questions about enterprise-wide technology, including the occasional issue. Here, Keith Jennings, MGH/MGPO chief information officer, sheds some light on a variety of things from the Information Systems (IS) perspective. This is the first in a series of Hotline articles about all things IS-related.
Throughout the past few months, there have been recurring issues with Home (H) Drive slowness, Shared File Area (Isilon) slowness and Citrix workspace applications or Virtual Desktop Environment. To shed light on these issues, this Q&A will explain what Isilon is, how Isilon issues affect staff and what IS has done to correct the platform.
What is Isilon?
Isilon is a high-performance storage platform from Dell/EMC that Partners HealthCare Information Systems (PHS IS) uses to store, manage and secure information across the Partners enterprise. A key feature of Isilon is its scalability – the ability for customers to keep adding more storage as their needs grow. Isilon is comprised of several 100 terabyte nodes, and as our storage needs grow we purchase and install additional nodes. This platform is an industry leader used by many organizations, including some of our peer institutions such as Harvard and the Broad Institute. Partners has four separate Isilon platforms installed in our datacenters that store more than a petabyte of data – equal to almost 60,000 movies or 20 million 4-drawer filing cabinets. It is the largest of these platforms – which we refer to as the General Purpose or GP Isilon – that has been having issues.
Why do Isilon issues affect us?
PHS IS uses the GP Isilon platform in many ways to support the needs of our clinicians and staff. That includes housing the H Drives and Shared File Areas, storing streaming data from neurophysiology monitors and storing content for various internet and intranet sites. The GP platform also houses connections/initialization files for many applications launched from the Partners/“P” menu, from within the Citrix MyApps/Workspace portal and the Virtual Desktop environment.
While PeC/Epic, often our bellwether for severity, was not affected by this spate of issues, the footprint of the GP Isilon platform is so large that the recent issues with it affected just about everyone at the MGH and across Partners.
What exactly happened?
About three months ago, various nodes in the GP Isilon began to go into spasm and either stop working or work extremely slowly. Users were unable to access or save files stored on the troubled nodes or, in some cases, could open or save files, but only after a 5- to 10-minute wait, making the application unusable from a practical standpoint. This same hang time would occur with some applications on the Partners menu or Myapps Citrix portal.
When a node went into spasm we could occasionally get it functioning again. While it took time, it kept the disruption localized to the users with data on the nodes in question. Other times, we had to reboot the entire GP Isilon platform, which restored the nodes but resulted in a 20- to 30-minute outage for all users.
How were the issues identified and fixed?
At first, the only thing we could determine was that the nodes having issues tended to be older nodes installed a few years ago when we began using the Isilon platform. The newer nodes seemed unaffected.
Unfortunately, we also had little instrumentation allowing us to see inside any nodes or the platform. Much like your car dashboard may have a speedometer, tachometer and check-engine light, we have a growing set of tools we can use to monitor our servers in the datacenter and the Partners computer network. But for now, we had little information on what was going on inside of the Isilon nodes. We knew when it was down but couldn’t see when trouble was starting.
PHS IS storage engineers, working with Dell/EMC, took the traditional first step of adding memory and processor power to the older nodes hoping that would reduce or eliminate the problem. Despite the increased horsepower, older nodes on the GP Isilon continued to experience spasms and outages.
Ongoing reviews proved inconclusive, while the outages continued to occur periodically. Dell/EMC suggested we purchase new nodes to replace the older, issue-prone nodes, but could not guarantee resolution. Since a key selling point of the Isilon platform was the ability to expand by adding new nodes without having to immediately upgrade existing ones, we engaged in negotiations and settled on a “try before you buy” model, where we would put new nodes into service, retire the suspect older nodes and would only pay for the new ones if the Isilon issues subsided.
The new nodes arrived at our datacenter in mid-October and were placed in service right after Halloween. The plan was to install the new nodes, take two weeks to migrate all the data from the older nodes onto the new ones and then remove the old nodes. This plan immediately reduced the load placed on the suspect older nodes without incurring a multi-hour downtime an immediate one-for-one replacement would require.
What was the root cause of all these issues?
Unfortunately, even with the new nodes and the reduced load on the older ones, the older nodes continued to occasionally spasm. This reinforced for PHS IS and Dell/EMC that there was an underlying and as-yet undiscovered root cause. Dell/EMC sent a team of experienced engineers to us and we set up a 24/7 war room in Assembly Row to review, inspect and research the issue.
After six days we appeared to find the root cause. While the Isilon platform is intended to support nodes of various ages and performance, we discovered a bug in the underlying firmware – software deep inside the application not visible or customizable by a customer – that was causing our older nodes to fail. Being in the firmware, the bug exists in every Isilon installation, at Partners and all other customers. However, it took a unique set of circumstances – the size and makeup of our nodes, the amount of data we store and how often we read or update it – to express the bug.
Dell/EMC took the findings back to their engineering department and updated their firmware to address the bug. The updated firmware was applied to our GP Isilon platform the week of Nov. 13, and since then the platform has remained stable. While potentially unnecessary, we are keeping the upgraded nodes and have removed the older nodes from the platform.
An additional positive outcome from the war room session with Dell/EMC was the creation of a set of monitoring tools that we will be able to use going forward. Initially designed to help diagnose the error, we can use them to monitor the help of the nodes and the platform, and respond quickly to any new issues that arise.
This article was originally published in the 12/01/17 Hotline issue.