Suppose we have a collection of y articles (say y = 100) out of which say, x articles are bad ones. Like out of the first 100 notes and essays (y) at this website, 99 are duds. The remaining one is the default Hello World post. In the next 100, the number of duds decrease to say 98, with the additional good one written by a guest writer, and so on.
Likewise, if we check 100 essays of another website and count the duds, it could be a different number. If it is Notebooks, it could be as small as 1 or 2. After scrutinizing enough websites, we can hopefully conclude that the number of duds (x) is, in principle, unpredictable. It is a random variable. A discrete one at that, as we can safely assume y and x to be integers (i.e. no dud half essays).
Further, once we perform this utterly useless operation for many websites, we could be in a position to even associate a probability to each discrete value of x (number of dud essays) out of y = 100 essays of a randomly selected website.
We can observe a discrete random variable x to take say s different discrete values, each of which can have different probabilities. The sum of all the probabilities is the probability that a trial (of checking 100 website essays at a website) will yield x to be of at least one of the possible s values (i.e. x will be one of ). In our example, this s can be in the integer range 0 to 100. Obviously, the sum of these probabilities is one.
That is, if we check 100 website essays at a random website, there will be dud essays of number varying anywhere between 0 and 100. To know about the behavior of the random variable, it is sufficient if we know about two properties of the random variable. Its Expectation and Variance.
Expectation or expected value of the random variable is the average value of the discrete values the random variable can assume, determined after performing a large number of trial experiments. It can be found as the sum of the product of each variable and its probability of occurrence as
We can also know for a particular trial, how much the discrete random variable x would scatter or deviate from its expected value E(x). That is we can ask what is the expected value for (x - E(x)). Unfortunately this turns out to be zero as shown below.
So we usually take the expected value of the square of the deviation rather than the deviation itself and define it as the variance V as
And the square root of the variance V is called the standard deviation of a random variable.
As we have performed this experiment several times, using the above characteristics, we have enough data to know about the nature of the discrete random variable x in future. That is, we can foretell (estimate) in a future trial of checking 100 essays at a random website, how many of them could be duds.
For instance, if our website experiment is performed on websites originating from two different continents, it is possible that we could observe two entirely different probability distributions (on the quantity of dud essays for each 100 essays) with respective E and V values. This would indicate that websites of one continent to be superior in some way over the other.
On the other hand, from the experience of reading websites for the past two years it is my conjecture that if we check only for two continents, we could end up with two different probability distributions (for how many dud essays in every 100 essays at each website) but we would arrive at a single E value, and just a handful of variances, depending on the nature of the website content (science, politics, celebrity life etc.)
Of course, all of this depend on what basis one should classify an essay as dud. For instance, what about this one?