HackerRank is an excellent website to create code based on prompt challenges, prepare for coding interviews, search for jobs, and to see how the community has approached the solutions over time. The author wanted to dive into the Python focused solutions, and is in no way affiliated with HackerRank itself.


The Challenge: Mean, Median, Mode

From 10 Days of Statistics Day 0: Mean, Median, and Mode:

Output Format

Print lines of output in the following order:

  • Print the mean on a new line, to a scale of decimal place (i.e., , ).
  • Print the median on a new line, to a scale of decimal place (i.e., , ).
  • Print the mode on a new line; if more than one such value exists, print the numerically smallest one.

Sample Input

10
64630 11735 14216 99233 14470 4978 73429 38120 51135 67060

Sample Output

43900.6
44627.5
4978

The top-voted Python 3 solution came out to be:

Python 3 - Dont reinvent the wheel ;)

import numpy as np
from scipy import stats

size = int(input())
numbers = list(map(int, input().split()))
print(np.mean(numbers))
print(np.median(numbers))
print(int(stats.mode(numbers)[0]))

To those who have been introduced to Python via data science courses and tools, this may seem like the solution one is looking for. Though, this is only the case if a project already includes the SciPy package.

Wait, Why Could This Be Bad Practice?

The scipy and numpy packages are third-party libraries, and they would have to be added to a requirements.txt, setup.py, Pipfile, or other dependency configuration in order to make use of them in a project. This adds complexity by piling onto the software supply chain.1

Installing scipy (which includes installing numpy as a dependency) results in:

Just this year, numpy had an Arbitrary Code Execution (ACE) vulnerability raised around how it was unpickling-by-default with numpy.load, which has since changed. The pickle module is known for this vulnerability risk, and has a big red warning about it in the Python docs.2

Using these third-party packages is overkill for a project that doesn’t already contain the libraries, unless you’d really like to be on the lookout for long GitHub Issue conversations and Common Vulnerabilities and Exposures (CVE) database entries (such as CVE-2019-6446 in this case) where you try to decipher how big a problem this is if it even is a problem at all.

Using Standard Libraries

How can we solve this problem with standard libraries that come with Python?

# With standard lib imports only
from statistics import mean, median

def basicstats(numbers):
    print(round(mean(numbers),1))
    print(median(numbers))
    print(max(sorted(numbers), key=numbers.count))

input() # Don't need array length, so ignore input
numbers = list(map(float, input().split()))
basicstats(numbers)

Detailed Code Breakdown

from statistics import mean, median
  • statistics has been included with Python 3 since Python 3.4 (released in 2014).
  • We only want mean and median from this library, so we are explicitly importing each rather than importing the entire library.
  • Why aren’t we using mode from statistics? This is because mode will error-out in cases where: "…if there is not exactly one most common value, StatisticsError is raised."3
    • This is a problem, due to the last requirement of the challenge for mode output: "…if more than one such [mode] value exists, print the numerically smallest one."
input() # Don't need array length, so ignore input
numbers = list(map(float, input().split()))
  • We do nothing with the first input(), which is meant to be a count of numbers being input in the second prompt. This is dropped because it is not needed in order to produce the mean, median, and mode output.
  • For numbers, let’s start from the inside-most parentheses and move outword:
    • input().split() breaks apart the single-string input into a list of strings, as split() defaults to whitespace as the sep delimiter: “If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].”4
    • map(float, input().split()): Here, map() is being used to convert the resulting list of strings into float type values.
    • list(map(...)): The reason we need to convert the map back into a list is because map() returns an iterator. This means we can only call the elements within it once. If all we wanted was the median, for example, we wouldn’t need to convert the map to a list type because we may not care about the values anymore after the median is returned.

NOTE: Instead of list(map(...)) , we could use a list comprehension5 like so:

numbers = [float(number) for number in input().split()]

This is argued as a better approach on StackOverflow,6 and if you are up for an interesting side note of history, you can read about how map() was nearly removed from Python 3 at one point.7

After we have our list of floats, basicstats(numbers) is called, running the following:

def basicstats(numbers):
    print(round(mean(numbers),1))
    print(median(numbers))
    print(max(sorted(numbers), key=numbers.count))
  • print(round(mean(numbers), 1)) from the inside-most parentheses and move outword to see what we are printing out:
    • mean(numbers): Simply returns the mean without a third-party package!
    • round(mean(numbers), 1) rounds the resulting float to one number after the decimal point (per requirements).
  • print(median(numbers)): Simply returns the median without a third-party package!
  • print(max(sorted(numbers), key=numbers.count)): how is this providing the mode?
    • sorted(numbers): First, we need the list sorted as we are only meant to return the lowest-value mode if their is more than one value. This is needed for max(...) to properly return the lowest value we want.
    • max(sorted(numbers), key=numbers.count)): Providing key=numbers.count as an arg is ensuring we get the value with the highest count within the list. max() only returns a single value, so it will return the first value, being the lowest in the event that there is a draw (due to use using sorted(numbers)).

Optional Approach to Retrieving Mode: Using Counter()

Instead of max(), we could alternately use Counter()8 from collections, which is argued to be a better approach to this problem.9 Counter() was added to the collections module way back with Python 2.7.0 (released in 2010):

# With standard lib imports only
from statistics import mean, median
from collections import Counter

def basicstats(numbers):
    print(round(mean(numbers),1))
    print(median(numbers))
    # Optional approach to 'mode'
    print(Counter(sorted(numbers)).most_common(1)[0][0])

input() # Don't need array length, so ignore input
numbers = list(map(float, input().split()))
basicstats(numbers)
  • Counter(sorted(numbers)).most_common(1)[0][0] working from the inside, out:
    • sorted(numbers) is needs for the later call of most_common() to return the lowest mode.
    • Counter(...): Creates a dictionary with count values of all elements in the list.
    • Counter(...).most_common(1): Returns a list of tuples. Using 1 as an arg means it returns only one tuple, being the first value that appears the most often.
    • Counter(...).most_common(1)[0][0]: The first [0] means we are calling the tuple in the 0 index position of the list, with the [0] calling the 0 index value of that tuple.

Conclusion

There are many ways to come to a solution, and depending on the situation, some are better than others. If packages like scipy and/or numpy are already included within a project, it certainly makes sense to use them.

Though, it is a great idea to take a look at whether built-in or standard libraries can solve a problem before looking into third-party solutions. This helps you:

  • Learn what Python is capable of out-of-the-box
  • Make your code more portable for use in other projects without installing additional resources
  • Reduce the security complexity of the software supply chain1 by avoiding unnecessary inclusion of third-party packages

Was this helpful? Have thoughts to add? Please add to the conversation on dev.to!