Becoming a Data Scientist: My Year-Long Hiatus from Medical School

An abridged version was also published on the UBC Master of Data Science blog

Part 1: Why I Did It?

In my second year of medical school, I decided to open an unfamiliar email entitled: “UBC Centennial Symposium on Health Informatics”. What was Health Informatics? I had no idea, but I intended to find out. The symposium featured the UK’s National Health Service (NHS) and showcased their use of large population datasets to improve the quality of NHS services in various ways - by providing research support, industry development, and quality improvement. I was impressed with how advanced the NHS was in using data to actually improve care. I went into medicine because caring for and establishing relationships with patients fulfills a personal desire to help people one on one. Applying data science to healthcare seemed to provide the opportunity to positively impact millions at a time, in addition to one at a time.

So why now? Well, we are reaching a critical mass to drive healthcare’s transition to digital, following many other industries. This is a combination of more health data, better health data, the rapid growth in the field of data science, better and faster computation and a greater cultural shift in healthcare as a whole.

  • More Health Data
    • In our increasingly digital worlds, where 90% of all data was generated in the past 2 years, healthcare is no exception. The beginnings of change are apparent: paper records are becoming electronic as electronic medical records (EMRs) are becoming universally adopted, the field of genomics is maturing, and patient-generated data is gaining momentum.
  • Better Health Data
    • There is a world-wide effort to improve the quality and accessibility of the health data that we capture. One issue with health data is interoperability - data from different systems don’t always talk nicely to one another. If two different clinicians describe a similar disease but use different words (eg: ‘Heart failure with preserved ejection fraction’ vs. ‘Congestive heart failure’), how can we bring these data together? There have been substantial (but incomplete) efforts on the health standards that can allow this communication (SMART on FHIR, FHIR 3.0). Another issue involves significant challenges we face in extracting data we care about - many clinical documents are unstructured. Although there’s a lot of valuable data in a clinician’s typed note, it’s difficult to extract the exact data we care about. There have been strides towards this with machine learning models (natural language processing), and with companies such as FlatIron Health making it their mission to structure data so that it’s useable.
  • Data Science
    • The field of data science (which encompasses machine learning and artificial intelligence) is rapidly evolving. With new tools and technologies we can handle larger amounts of data, in addition to using that data more intelligently for descriptive, diagnostic, and predictive analytics.
  • Better and Faster Computation
    • With the combination of Moore’s law and the easy accessibility of the ‘cloud’, right now is the easiest it’s ever been to get computing resources. If data is the fuel, computation resources are the engines, and in the past few years we’ve quickly upgraded from a 1000cc engine to a jet engine.
  • Cultural Shift
    • Finally, there has been a major culture shift within and outside of healthcare. The medical community has embraced the necessity and role of real-world evidence (RWE) - data from EMRs - in filling the gap between efficacy and effectiveness. The FDA has endorsed RWE to support regulatory decision making. The American Medical Association has released it’s first policy recommendation on AI. The NHS has (controversially) partnered with Google. Private corporations are making efforts to scale healthcare’s previously impenetrable walls. Apple has announced personal health records, with multiple American medical facilities allowing users to access and download their own patient data from the hospital onto their phone. Esteemed and respected doctors are embracing private innovation, with leaders like Dr. Eric Topol and Dr. Atul Gawande (recently appointed CEO of Amazon-Berkshire Hathaway-JP Morgan joint venture) at the forefront.

So Why Data Science?

I saw the University of British Columbia (UBC) Master of Data Science as providing me the tools to realize two main objectives: connecting the two fields of healthcare and data science, and identifying healthcare challenges amenable to data-driven solutions - solutions that I could personally execute. Clearly, change is fast approaching in healthcare, and I wanted to contribute. I was in the process of completing clerkship at UBC medical school (hospital day and night and studying in between - think Scrubs), and felt that it was now or never, that this field is expanding so quickly that I wanted to inject my education with technical literacy. My thought was that this would kick-start my technical knowledge so that as I become a more rounded clinician, I can actively incorporate what I learn about data science into my practice.

As someone who is already embedded in the healthcare system, I believe that I’m uniquely poised as a data ambassador to communicate between technology and healthcare, and to help propagate the inevitable data revolution. Rather than being on the sidelines while data science is employed across clinical decisions, research, policy, innovation and governance, I envision myself as a data literate doctor who can understand both clinicians and data scientists. I believe it will become even more vital to have data literate doctors working with healthcare authorities, with hospital management, in research and as interfaces with government and private corporations.

Learning the science of data - programming, statistics, machine learning - equips me with the practical skills of building a data-drive tool. More importantly, it supplies me with the understanding of what is/isn’t feasible from technology, and above all how to communicate with all the brilliant doctors and data scientists who will make healthcare better.

Part 2: What I Learned

UBC’s Master of Data Science is a 10 month intensive program focused on developing statistics, computing and machine learning expertise, working with multiple real-life data sets. It’s topped off by an industry capstone project, where students get to apply all that they’ve learned in a 2 month industry project. My team worked with health tech startup QxMD, building a recommendation system for medical research papers - our recommender outperforms the current system and will be deployed to production in July 2018 (500,000 users).

Let’s back up a bit. I started this Master’s with a very basic knowledge of programming - nothing more than an introductory computer science course under my belt. The curriculum ramps up quickly, starting with basic programming and statistics and then with each subsequent week building exponentially. I went from math and statistics being avoided subjects to subject matter that I can think critically about, and from virtually no programming skills to being confident in my coding abilities. I went from knowing AI as a buzzword (or something from a Hollywood movie) to practically applying various models practically, knowing when to apply each, and their various limitations (and there are many).

While there are countless lessons I’ve learned in the past year, the following are a succinct distillation:

  • Understanding the Data Scientist’s Tools (and their limitations)
    • A data scientist’s role is largely to obtain data, make it usable, then decide how to use it and finally evaluate how it was used. This seems simple enough, but upon further examination each step comes with the need to make countless decisions, and each decision has its own strengths and weaknesses. Every step of the process introduces various consequences, for instance even how one chooses to handle missing data can have significant repercussions on the output. There are many models to choose from, each satisfying different purposes with different assumptions and biases. All of these decisions are important to understand when evaluating a model’s output. The program provided me with the ability to cut through the ‘big data’ hype and think critically about the choices being made and their consequences.
  • Coding
    • While I can confidently code in Python and R, the larger impact is that I’m no longer intimidated by the prospect of learning a new coding language. I learned how to learn online: finding online resources, reading documentation, and then acquiring the necessary knowledge to execute a task - whether it be making a personal blog/website, writing a daily email reminder script with Javascript or creating a webapp with Django.
  • Thinking Statistically
    • In medicine we are taught to think as Bayesians. This is the idea that a doctor has a belief about a patient’s diagnosis that is updated with each additional morsel of information. For example, let’s say a patient comes into the emergency room complaining of chest pain. Are they having a heart attack? If they’re a smoker, diabetic, and have a family history of heart disease (all risk factors for coronary artery disease), the doctor will be more convinced that this person may indeed be having a heart attack. I found the transition from ‘human’ Bayesian thought which I’d learned in medical school, to the actual math, to be fascinating. A belief about a diagnosis was just a mathematical distribution (like a Gaussian curve) which was molded and transformed with each additional piece of information. Soon other thoughts found their data science corollaries: pursuing a goal in my personal life felt like an algorithm’s attempt at finding a global maximum, and I started thinking in terms of machine learning terms like ‘objective function’. In addition to thinking more statistically in my daily life, I have a newfound appreciation (and skepticism) for research. With more baseline knowledge, reading research papers has become approachable and interesting (a previously foreign concept).
  • Speaking Data
    • I learned the language of data flow. APIs, databases (SQL vs. NoSQL), data transfer formats (XML, JSON), Docker and other tools and terminology were previously meaningless, and now are dense with meaning, experience, and opinions. My favourite analogy from an a16z podcast elegantly connects my two worlds by drawing analogies between data flow and blood flow in the body’s cardiovascular system. The beauty of being able to speak both ‘medical’ and ‘data’, is being able to explain complex concepts to different audiences based upon their domain knowledge. In my final project at the health tech start-up (QxMD), I explained advanced machine learning concepts to a medical audience with medical analogies , and then explained medical concepts to a tech audience with tech analogies. The application of machine learning requires a combination of data and medical literacy, and I feel confident that MDS has prepared me well.

What next? The medical field understands that to be a good doctor, it requires ‘lifelong learning’. Both the fields of medicine and data science are vast and ever-expanding, and in fact have ever-expanding intersection. MDS has provided the tools and critical thinking to take the next steps into this intersection, the rapidly evolving field of Health Informatics. I’m excited to throw myself back into medicine, equipped with this new data science mindset and am mindful of the huge potential these disciplines offer each other.