Essential Reading for Data and Machine Learning
Ian Hellström | 25 February 2020 | 10 min read
Here’s my list of essential reading for people in the data and machine learning space.
This guide is to help build a foundation in technology, data engineering, machine learning, and tech leadership. If you already have such a foundation and merely want to stay up to date with current trends, please check out my recommendations instead.
Quite a bit of open-source data technology originated in some form or another at Google. Their research papers are therefore worth reading, particularly:
- BigQuery is the query engine Dremel atop Colossus
- Spanner and how it achieves global availability and consistency while also ensuring partition tolerance
Another worthy addition is Microsoft’s 1999 paper on sampling over joins. For local development or CI/CD, a sample of real data is often helpful besides data generation for unit tests, but it’s not as straightforward to grab a random subset of data when joins are involved. This paper talks about that problem in depth.
This is Martin Kleppmann’s book on data systems and architecture. It gives a modern perspective on data technologies that anyone in the field ought to have thumbed through. It even references yours truly with regard to various open-source streaming technologies.
Hot off the press! In this book, Mark Richards and Neal Ford go through various architectural styles and fallacies. It’s a great and modern overview of various architectures and relevant technologies. It mention ADRs as a best practice.
Coding or at least involvement in code reviews is not listed as a core expectation of architects. That, in my experience, leads to PowerPoint architects who are too far removed from the code base and nitty-gritty of technology to make sensible architectural decisions.
Their 20-minute rule is simple yet valuable:
The idea of the [20-minute rule] is to devote at least 20 minutes a day to your career as an architect by learning something new or diving deeper into a specific topic.
That’s a worthwhile investment for any engineer!
Adam Tornhill shows techniques that blend complexity metrics familiar to anyone who’s ever used static code analysis tools as well as organizational metrics to see hotspots of possible issues.
The book is not at all about becoming a black-belt
git blame user, but a structured approach to applying profiling, inspired by crime fighting, to figure out where the technical debt lurks, in a prioritized list that can be acted upon.
The scripts are somewhat obsolete as they morphed into a product: CodeScene.
While not related to data, machine learning, or leadership at all, this book 1-ups your shell-fu by showing what can be done in a basic terminal, which in the long run will make you more productive, unless you learn how to use the shell for making games…
Python is the language for data science and machine learning, but Scala is the language of choice for data engineering. For the latter, I have previously recommended material. For Python it’s often best to look for resources on whatever library you intend to use, for instance, scikit-learn, TensorFlow, PyTorch, but there are many more.
A comparison of end-to-end machine learning platforms is available right here on Databaseline. An important resource related to that is the article on Hidden Technical Debt in Machine Learning Systems, which was presented at NIPS 2015, and contains a lot of the pitfalls of ML-powered solutions:
There is sometimes a hard line between ML research and engineering, but this can be counter-productive for long-term system health. It is important to create team cultures that reward deletion of features, reduction of complexity, improvements in reproducibility, stability, and monitoring to the same degree that improvements in accuracy are valued. In our experience, this is most likely to occur within heterogeneous teams with strengths in both ML research and engineering.
Federated learning is an approach, in which individual devices train a model locally and communicate with a central model to give and receive updates from other devices in the network. This is not relevant for a large number of businesses, but if you make apps for mobile devices or manage a fleet of IoT devices, it may be an idea worth exploring, especially when latency or privacy are a concern. There are even ways to preserve privacy, as described here.
In ‘Learning from Data’, the authors do not teach you how to do machine learning, but they explain why machines can learn. It is a theoretical book that revolves mostly around the VC dimension, but that does not mean it’s hard to understand. It’s the best book on the topic.
The Elements of Statistical Learning is a classic in the field of traditional machine learning algorithms. A simpler version, called An Introduction to Statistical Learning, is available, too. Both include code in R.
Mr Murphy’s very accessible and somewhat Bayesian approach to ML comes with lots of Matlab code. It mostly focuses on traditional (parametric) algorithms rather than deep and/or wide neural networks. For an even more rigorous Bayesian treatment of classical algorithms, please check out Pattern Recognition and Machine Learning.
It’s a true whopper of a book with more than 1,100 pages, which does not make it a great choice for reading on the beach—unless you need kindling for a bonfire. The book’s scope is larger than most books I have listed because it discusses artificial intelligence in more general terms, including, but not limited to, logic, robotics, and perception and vision. It has a short chapter on philosophical matters, although ethics does not receive as much space as it ought to have. Still, I can recommend it to anyone who wants to know more about AI and not merely about the application of ML.
Goodfellow, Bengio, and Courville give away a book on deep learning, if you care to read it in a browser. At 56 pages of references, there is also enough material for anyone who needs to have even more details or backgrounds.
Probabilistic Graphical Models
Sucar’s book is more accessible than the tome by Koller and Friedman. Both books provide plenty of details on the inner workings and applications of PGMs. In case you have not yet heard of PGMs, they are used in causal inference, speech recognition, computer vision, and graphical models for protein structure, to name but a few applications.
A bit of an oddball in this category, but definitely worth a look, since it contains interviews with 25 data scientists. The book does not provide any technical details, but anyone looking for career advice or wanting to know what it’s like to be a data scientist at different organizations and at various levels, there are plenty of insights.
A classic by Fred Brooks. It tells how nine women cannot have a baby in one month, although many software projects are run in exactly that way. We’re late? Let’s add more resources. But,
[people] and months are interchangeable commodities only when a task can be partitioned among many workers with no communication among them.
[a]dding manpower to a late software project makes it later.
Camille Fournier maps a path from engineer to CTO, although in many cases engineers who desire a leadership position choose to stay individual contributors (e.g. as principal engineers or architects) or become tech leads. It warns against promoting the ‘alpha geek’ into a position of leadership that includes responsibility beyond technical matters.
A great line from the book is by Catie McCaffrey:
Being a tech lead is an exercise in influencing without authority.
It can indeed be quite a challenge to lead without having the ability to tell people what to do. But of course a tech lead’s main skill lies in ‘the willingness to step away from the code and figure out how to balance your technical commitments with the work the whole team needs.’
The book collects a lot of advice for anyone who aspires to lead engineers or already does.
Are techies really that different? If that were a headline, it’d violate Betteridge’s law. The book describes how leading technologists often requires an approach different from what is taught in general management programmes.
In ‘Talking with Tech Leads’, Patrick Kua converses with novice and veteran tech leads. A lot (but not all) of their advice is valuable, and it may depend on your style of leadership. One of my favourite quotations from the book is by Roy Osherove:
Whatever questions I’ve had, I’ve received either incomplete or just horrible answers from people whom I looked up to in other leadership roles. […] It seemed that everyone around me was winging it as much as I was.
Susan Cain’s book talks about introverts. Why? Because roughly half the population is introverted or at least has traits in line with introversion, yet the world often expects extroverts. No self-respecting leader cannot know about different personalities and pretend to lead. To those who are introverted, the book may also set your mind at ease: yes, you’re different but not alone and definitely not abnormal. It’s society that has primed us all to think extroversion is the gold standard.
This is a very short book, in which Seth Godin rants about the importance of initiative. If you have been guilty of saying, ‘Someone really ought to do something about this,’ only to cross your arms, lean back, and watch nothing happen, then this book may have some insights for you.
The author asks the same question in the subtitle that Suzy Welch recommends to determine whether to quit: When was the last time you did something for the first time? If the answer is never or a while ago, then you’re not learning anything, and it may be time to move on.
A great paragraph is on why it’s important to improve the entire team and not merely allow the star members to develop their skills by working on the latest and greatest projects:
In the short run, playing your strongest player, following the playbook, rewarding someone who has done it before—these are all great ways to win. In the long run, though, all you’ve done is taught conformity and punished initiation.
Zapier’s free e-book on their decade of being a fully remote company contains lots of information and links to even more information, in case you ever want to convince your organization of the benefits of fully remote work.
Emily Chang talks about the brogrammer culture that pervades much of Silicon Valley and has even reached the shores of those who seek to imitate its so-called ‘meritocracy’:
Privilege accumulates as you advance in life. If the college you attend is the basis of your future employment networks, then it is impossible to say that your employment success is solely based on merit.
To ensure a workforce representative of the population at large, leadership needs to be sensitive to ingrained behaviour and language that impedes efforts to increase diversity. After all, how a company phrases job ads and who is promoted for what reasons reveal more about the culture than a set of made-up corporate values.
Bias in Data and Applications
No essential reading guide would be complete without resources on bias. This is especially important in data tech because biased data and/or algorithms can exacerbate inequalities.
Winner of the Euler Book Prize and written by Cathy O’Neill, WMD goes on a tour through ‘Big Data’ and shows how
[t]he privileged […] are processed more by people, the masses by machines.
This exacerbates any bias already present in algorithms, but what’s perhaps even worse is that
[t]he human victims of WMDs […] are held to a far higher standard of evidence than the algorithms themselves.
As O’Neill writes,
the point is not whether some people benefit. It’s that so many suffer. These models, powered by algorithms, slam doors in the face of millions of people, often for the flimsiest of reasons, and offer no appeal. They’re unfair.
This book is written by Caroline Criado Perez who, among many things, spearheaded the effort to have Jane Austen on the £10 bank note. It exposes ‘data bias in a world designed for men’. You may not know it, but the world is really designed for Reference Man. No, that’s not a terrible new superhero who has memorized the entirety of Wikipedia. It’s the misguided belief that a white, fairly svelte (70 kg) man in his late twenties can stand in for all humanity when it comes to product design. He obviously cannot, but people often forget lessons of old or fail to apply them out of sheer laziness.
The book is an eye opener and offers many important pieces of information, such as:
Recent research has emerged showing that while women tend to assess their intelligence accurately, men of average intelligence think they are more intelligent than two-thirds of people.
Ms Criado Perez also mentions studies to allay anyone’s fears about quotas that aim to increase diversity:
[Q]uotas, which, contrary to popular misconception, were recently found by a London School of Economics study to ‘weed out incompetent men’ rather than promote unqualified women.
And to those who are reluctant to believe women can lead competently and be at least as innovative as men, she has cold hard statistics:
For every dollar of funding, female-owned start-ups generate seventy-eight cents, compared to male-owned start-ups which generate thirty-one cents.
All I can say is: read it. As I have said before: it’s about damn time we stop treating women as an edge case.
Sara Ann Marie Wachter-Boettcher has had her share of issues with forms that do not allow her full name to be entered, but in ‘Technically Wrong’ she talks about more than that. It’s an exposé of an industry that does not seem to care to get it right for anyone but themselves: young, white men.
This is why I think of non-inclusive forms as parallel to microaggressions: the daily little snubs and slights that marginalized groups face in the world—like when strangers reach out and touch a black woman’s hair. Or when an Asian American is hounded about where they’re really from (no one ever wants to take “Sacramento” as an answer). […] When systems don’t allow users to express their identities, companies end up with data that doesn’t reflect the reality of their users. And […] when companies (and, increasingly, their artificial-intelligence systems) rely on that information to make choices about how their products work, they can wreak havoc—affecting everything from personal safety to political contests to prison sentences.
She talks about research on why diversity matters:
[W]hen we work only with those similar to us, we often “think we all hold the same information and share the same perspective[.] This perspective […] is what hinders creativity and innovation.”
Perhaps the pithiest observation of the book is this:
The only thing that’s normal is diversity.