When it comes to what direction big data will take next, it is often healthcare that is driving the discussion.
Although large, UC Irvine Medical Center presents a typical healthcare data situation. They house records for more than a million patients. Those records are made up of radiology images, other semi-structured reports, unstructured physicians’ notes, and additional spreadsheet data. Where UC Irvine stands out is how they’ve taken on their masses of patient information. The center has created a data lake (based on a Hadoop architecture), giving them the ability to store their disparate records in native formats until they’re ready for parsing at a later date — very different than the forced integration we see in data warehousing scenarios.
The Obama Administration released Big Data: Seizing Opportunities, Preserving Values in January of this year. The primary goal of the report was to increase effectiveness in harnessing big data to the benefit of the American public. CMS has launched the Virtual Research Data Center (VRDC), a portal that will allow researchers access to the massive amounts of health data from their own computers. That means great data transparency and likely, better insights into improving current healthcare systems.
Where Lakes Come In
Healthcare is a vast land of data, much of it unstructured and “unclean.” Initiatives like the VRDC take aim at making this data useful, and this is where data lakes come in. The massive databases will house data in its most raw form, giving administrators the option to format and standardize it when needed to make it machine readable and easy to use. Of course, large stores of data like these hold risks, especially in healthcare. Hypothetically, they could mean access to patient-level information over a patient’s entire life. Access would have to be closely monitored to maintain any level of safety and security.
What Lakes Have To Offer
Back to the UC Irvine example, keeping data in its native format provides multiple benefits. It helps maintain data provenance and fidelity, allowing for analyses to be performed using different contexts. It also makes possible different data analysis projects — in the case of healthcare, that means functions like predicting likelihood of readmissions, giving a facility a chance to take advanced measure to reduce that number.
We’re just scratching a very thin surface of data lakes here, but PricewaterhouseCoopers has published an article that covers more of the intricacies of data lakes. It discusses basic Hadoop architecture for scalable data lake infrastructure, the basic function of data lakes, how they solve enterprise level issues of data accessibility and integration, how data flows through a lake, and how a lake matures with increasing usage across the enterprise.