In my
last blog
posting, I discussed
the inner workings of the File System Crash storage mechanism.
Today, I'm going to talk about how that system has been retrofitted
into the modern Crash Storage API. First, however, that's going to
require another excursion into the past.
The early installation of Socorro had
hardware resource problems. It wasn't clear at the beginning exactly
how many crashes we'd be getting or what the size of the crashes were
going to be. For the first couple years we were starved for
processing power and disk space. If I recall correctly, our machines
were surplussed from AMO. Our Postgres server had only 4G of RAM and
non-local disk storage unable to rival the performance of laptops
given to employees of that era.
Our mandate was to for a developer to
be able to see the results of processing any crash within sixty
seconds of a request. At the time, just running MDSW would take
thirty seconds. It was clear to me that we could not afford to
process every crash that we received. If we tried, we'd slip further and
further behind. I recall one point having a backlog of over five
million crashes. It was taking days between submission and processing.
We decided to start processing a
sampling of crashes and arbitrarily chose fifteen percent. I
implemented a sampling system that eventually evolved into the
selective throttling system that is in use today. This throttling
system divided the population of crashes into two sets: processed and
deferred. The mandate was to store all crashes. Any crash, even if
not initially processed, had to be eligible for processing on demand
within one minute. My aforementioned priority processing scheme was
to handle the sixty second on demand requirement.
Disk space constraints lead us to save
the standard and deferred crash populations separately. We used the
file system storage scheme twice for raw crash storage. “Standard
Storage” was for crashes destined for processing. “Deferred
Storage” was for crashes that were not processed unless
specifically requested. We used the file system crash storage a
third time to store the processing results in “Processed Storage”.
The latter two file system storage schemes were never in need of the
indexing by date, so the internal file system branch “date”
didn't exist for them.
The file system storage was retired in
2010 when we graduated to HBase. Sadly HBase was initially very
unstable and we lost crashes when it was down. We called the File
System Storage back from the old folks' home to serve as a buffer
between the collectors and HBase. The collector pushes the crashes
into the file system storage because its proven stability. A new
app, appropriately called the crash mover, then walks the “date”
tree and moves crashes into HBase. This allows the collectors to be
immune to direct trouble with HBase.
The substitution of the File System
Storage within the collector was my inspiration for the Crash Storage
API, now being deployed three years later. In this modern Crash
Storage world, we have three classes that implement file system crash
storage schemes:
FileSystemRawCrashStorage
,
FileSystemThrottledCrashStorage
,
FileSystemCrashStorage
. With these
classes, Socorro is able to scale from tiny installations receiving
only a handful of crashes per day to huge ones that receive millions
per day.
FileSystemRawCrashStorage
This class is the simplest. It has one file system root for all crashes without regard for the throttling status of any given crash. This is the crash storage classes used by the collectors. Since the throttle status is saved within the crash itself, and the crash mover doesn't care about that status, we don't need two different storage locations. This class declines to implement processed crash storage.
FileSystemThrottledCrashStorage
This class couples two instances of the
file system crash storage. It has file system roots for a standard
tree of crashes (to be processed) and deferred storage. Unlike the
original system, when a crash from deferred storage is called up for
processing, it isn't moved from deferred to standard storage. I'm
undecided if that is a flaw or not. The other flaw that I see in
this implementation is the “date” branch in the deferred storage.
Deferred storage is never indexed by date, so this extra file system
wrangling to support it is unnecessary.
FileSystemCrashStorage
This is the complete package – it
implements the entire crash storage API: standard, deferred, and
processed. In the original file storage system, processed storage
was a separate class independent of the raw crash storage. This
class takes all storage types and unifies them under the banner of
the Crash Storage API.
So why do these all have separate roots? Can't they be combined? It makes no sense to combine the stardard and deferred storage. The effect would be just like using standard storage alone. The processed storage can share a root with either of the standard or deferrred without contention. The main reason that these have separately configurable roots is that I want to give people who deploy Socorro the same opportunity that we had to distribute storage over different file systems.
So why do these all have separate roots? Can't they be combined? It makes no sense to combine the stardard and deferred storage. The effect would be just like using standard storage alone. The processed storage can share a root with either of the standard or deferrred without contention. The main reason that these have separately configurable roots is that I want to give people who deploy Socorro the same opportunity that we had to distribute storage over different file systems.