Simon Tan
Musings of a curious developer ت
Data & Data Exploration
Lesson 2 of CS109- What is data and how to go about exploring data.

Simon: It took me a while to complete this post because I did not like the first approach I took (which was basically summarising the slides). I think it is more interesting to just share a high level overview of what the lesson is about and to talk about some of the cool things that I learnt! I am still refining this process as I go along so please bare with me. Without further ado, let’s begin!

TL;DR

Lesson 2 was a very thorough introduction to all things data: what data is, the different types of data, storage of data, common problems with data etc. There was a bit of math involved as well but nothing a quick google search can’t solve! Subsequently, I was exposed to the idea of effective visualisation of data (and the really cool Anscombe’s quartet!) and I did my very first lab as well. I already knew some python and statistics from my days in uni but this refresher definitely helped!

TIL

What is messy data?

FriSatSun
Morning1515810
Afternoon29020
Evening551245

From a layman’s perspective, I thought this data was considered pretty clean but apparently there is a mistake here… The values Fri, Sat, Sun and Morning, Afternoon, Evening correspond to values of the attributes Day and Time respectively. The supposedly cleaner way to show this data is the following:

IDTimeDayNumber
1MorningFri15
2MorningSat158
3MorningSun10
4AfternoonFri2
5AfternoonSat90
6AfternoonSun20
7EveningFri55
8EveningSat12
9EveningSun45

I understand the reasoning for this but I do not necessarily agree that the first table is messy but if this the convention then I just have to get used to it!

Getting better at Statistics

A tip that helped me to understand and apply statistics better is that you need to be able to differentiate between sample and population. It might be quite obvious in hindsight but trust me, sometimes you might accidentally use a formula that is only applicable to populations when in reality, what you want is a formula for the sample.

Also, thank you math.stackexchange for the mathjax cheatsheet! Now I can write beautiful maths like this:

$$median = \begin{cases} x_\frac{n+1}{2}, & \text{if $n$ is odd} \\
\frac{x_\frac{n}{2} + x_\frac{n+1}{2}}{2}, & \text{if $n$ is even} \end{cases}$$

Anscombe’s Quartet

Anscombe’s Quartet comprises of 4 different datasets that have almost identical descriptive statistics - mean, sample variance of x etc., but when you plot these datasets out and compare them side by side, they are very different from one another!

Anscombe’s quartet

The purpose of Anscombe’s quartet is to “demonstrate both the importance of graphing data before analyzing it and the effect of outliers and other influential observations on statistical properties”.

Effective Visualisation

When trying to communicate your findings across, there are no hard and fast rules that you must adhere to but there are definitely some recommended approaches for highlighting certain features. For instance, if you want to focus your audiences’ attention on the descriptive statistics and outliers in your dataset, you might use something like a box plot but if you want to show a relationship between 2 numerical attributes, you might use a scatter plot instead. Here are some more tips from tableau.

Lab

Things I found pretty cool

  • Many ways to slice a list!
    arr = [0, 1, 2, 3, 4]
    print(arr[:-2])     # [0, 1, 2]
    print(arr[:-2:2])   # [0, 2]
    print(arr[-1:])     # [4]
    print(arr[-1])      # 4
    
  • List comprehensions (similar to javascript’s Array.prototype.map)
    even = [i for i in range(10) if i % 2 == 0]
    print(even) # [0, 2, 4, 6, 8]
    
  • Lambda functions (anonymous functions, no need explicity define return)
    square = lambda x: x*x
    print(square(3)) # 9
    
  • using .get(key, default value) to access values in a dictionary instead of just dict[key] because it handles KeyError.
  • zip() combines 2 iterable sequences together and forms a new sequence with each element being a key value pair.
    number_list = [1, 2, 3]
    str_list = ['one', 'two', 'three']
    result = zip(number_list, str_list)
    result_set = set(result)
    print(result_set)  # {(2, 'two'), (1, 'one'), (3, 'three')}
    
  • Useful libraries for statistics - numpy, matplotlib.pyplot, scipy.stats & statsmodel

Proof of work

  • Naive prime number generation, running time = $O(n^2)$
    primes = [2]
    for i in range(3, 50, 2):
        isPrime = True
        for j in range(3, i, 2):
            if i % j == 0:
                isPrime = False
                break
        if isPrime:
            primes.append(i)
    print(primes)
    
  • is_prime function, running time = $O(\sqrt n)$
    def is_prime(n):
    if n == 2 or n == 3:
        return True
    if n % 2 == 0 or n < 2:
        return False
    for i in range(3, int(n**0.5)+1, 2): # start from 3, goes up to sqrt(n) + 1, skip even numbers
        if n % i == 0:
            return False
    return True
    

Other useful resources


Last modified on 11 October 2020.
Attributions, if any, can be found here.