*Simon: It took me a while to complete this post because I did not like the first approach I took (which was basically summarising the slides). I think it is more interesting to just share a high level overview of what the lesson is about and to talk about some of the cool things that I learnt! I am still refining this process as I go along so please bare with me. Without further ado, let’s begin!*

# TL;DR

Lesson 2 was a very thorough introduction to all things data: what data is, the different types of data, storage of data, common problems with data etc. There was a bit of math involved as well but nothing a quick google search can’t solve! Subsequently, I was exposed to the idea of effective visualisation of data (and the really cool Anscombe’s quartet!) and I did my very first lab as well. I already knew some python and statistics from my days in uni but this refresher definitely helped!

# TIL

## What is messy data?

Fri | Sat | Sun | |
---|---|---|---|

Morning | 15 | 158 | 10 |

Afternoon | 2 | 90 | 20 |

Evening | 55 | 12 | 45 |

From a layman’s perspective, I thought this data was considered pretty clean but apparently there is a mistake here… The values `Fri`

, `Sat`

, `Sun`

and `Morning`

, `Afternoon`

, `Evening`

correspond to values of the attributes `Day`

and `Time`

respectively. The supposedly cleaner way to show this data is the following:

ID | Time | Day | Number |
---|---|---|---|

1 | Morning | Fri | 15 |

2 | Morning | Sat | 158 |

3 | Morning | Sun | 10 |

4 | Afternoon | Fri | 2 |

5 | Afternoon | Sat | 90 |

6 | Afternoon | Sun | 20 |

7 | Evening | Fri | 55 |

8 | Evening | Sat | 12 |

9 | Evening | Sun | 45 |

I understand the reasoning for this but I do not necessarily agree that the first table is messy but if this the convention then I just have to get used to it!

## Getting better at Statistics

A tip that helped me to understand and apply statistics better is that you need to be able to differentiate between sample and population. It might be quite obvious in hindsight but trust me, sometimes you might accidentally use a formula that is only applicable to populations when in reality, what you want is a formula for the sample.

Also, thank you math.stackexchange for the mathjax cheatsheet! Now I can write beautiful maths like this:

$$median =
\begin{cases}
x_\frac{n+1}{2}, & \text{if $n$ is odd} \\

\frac{x_\frac{n}{2} + x_\frac{n+1}{2}}{2}, & \text{if $n$ is even}
\end{cases}$$

## Anscombe’s Quartet

Anscombe’s Quartet comprises of 4 different datasets that have almost identical descriptive statistics - mean, sample variance of x etc., but when you plot these datasets out and compare them side by side, they are very different from one another!

The purpose of Anscombe’s quartet is to “demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties”.

## Effective Visualisation

When trying to communicate your findings across, there are no hard and fast rules that you **must** adhere to but there are definitely some recommended approaches for highlighting certain features. For instance, if you want to focus your audiences’ attention on the descriptive statistics and outliers in your dataset, you might use something like a box plot but if you want to show a relationship between 2 numerical attributes, you might use a scatter plot instead. Here are some more tips from tableau.

## Lab

### Things I found pretty cool

- Many ways to slice a list!
`arr = [0, 1, 2, 3, 4] print(arr[:-2]) # [0, 1, 2] print(arr[:-2:2]) # [0, 2] print(arr[-1:]) # [4] print(arr[-1]) # 4`

- List comprehensions (similar to javascript’s Array.prototype.map)
`even = [i for i in range(10) if i % 2 == 0] print(even) # [0, 2, 4, 6, 8]`

- Lambda functions (anonymous functions, no need explicity define return)
`square = lambda x: x*x print(square(3)) # 9`

- using
`.get(key, default value)`

to access values in a dictionary instead of just`dict[key]`

because it handles KeyError. `zip()`

combines 2 iterable sequences together and forms a new sequence with each element being a key value pair.`number_list = [1, 2, 3] str_list = ['one', 'two', 'three'] result = zip(number_list, str_list) result_set = set(result) print(result_set) # {(2, 'two'), (1, 'one'), (3, 'three')}`

- Useful libraries for statistics - numpy, matplotlib.pyplot, scipy.stats & statsmodel

### Proof of work

- Naive prime number generation, running time = $O(n^2)$
`primes = [2] for i in range(3, 50, 2): isPrime = True for j in range(3, i, 2): if i % j == 0: isPrime = False break if isPrime: primes.append(i) print(primes)`

- is_prime function, running time = $O(\sqrt n)$
`def is_prime(n): if n == 2 or n == 3: return True if n % 2 == 0 or n < 2: return False for i in range(3, int(n**0.5)+1, 2): # start from 3, goes up to sqrt(n) + 1, skip even numbers if n % i == 0: return False return True`

### Other useful resources

- Python Data Science Handbook by Jake Vanderplas
- Chris Albon’s website

Last modified on 11 October 2020.

Attributions, if any, can be found here.