Traits of Data scientist
Being relentless in a lazy way
Most of your model assumptions are incorrect. He mentions 2/3 of his ideas failed. Being lazy means you DO move on to the next idea and don’t stick to it.
Humility
Being at best competent as a statician and scientist. You know enough to talk to expert but you’re not an expert but this allows you to learn.
Math vs. Computer Science
Need to be good at both. Best to atleast study one of these and otherwise study science. Just studying Science (biology, neuroscience, ..) will give you the tools like relentlessness.
Big Oh notation is important as it gives you the implications of runtimes of programs
Know your data structures
Computer science has algorithms and data structures, Data science has random variables and probably modeling
The difficulty is to find the context when none is provided. Give someone a challenge and mention that they should use a Hashmap. Chances are they won’t have a problem implementing it. Give someone a challenge without any context (don’t mention hashmap) and now you have a proper indication of a real-world response.
Statisticians typically don’t have the computer science habits. Modular software, testing, code reviews, continuous integration, source control, ..
Kaggle competitions
Kaggle skips a lot of steps. A lot of the work has been done for you. Implementing the machine learning part is typically one of the last steps.
Hadoop & data science
If the traditional infrastructure isn’t sufficient, a scientist will find a way to get it to run. It’s part of the job.
Unit of analysis problem
Relational databases are bad at:
- COUNT DISTINCT - Star models are not optimized for this
- Cursors - If you are using cursors, there is something wrong
- ALTER TABLE_OF_DOOM - That one table in your data warehouse that you cannot touch
Databases are optimized to analyse transactions. Financial reporting, enterprise resource planning, …
Asking questions where all the things you need to know are not in a single table row is typically unperformant in a database.
Hadoop is simply a file system. You can structure your data as you like. Pull all the fields you need into a single record.
Where should you work when starting out
Start up - Wearing two hats, jack of all trades.
Close to the money - You can convince people how much money you save or make the company.
Education and growth
Data scientists usually work in lonely teams, by themselves. No contact and no sharing of data because it’s too valuable. No sharing of best practices. Need for communities.
Realtime ML
Realtime ML is rarely necessary. Pattern in data doesn’t change so fast.
Sometimes writing code to test ‘how something will turn out’ takes longer than implementing an experiment on live users and seeing the result.
Valuable stats content to learn
- Linaer Regression
- T Test
- Binominal distributions
- Confidence intervals
Josh Wills
Funny and down to earth guy. And I like that he’s wearing sandals :)