CST383: learning log 3

This week was about exploring data quickly and choosing the right visualization. Using df.info(), df.describe(), and df.nunique() helped me get a fast sense of a dataset, especially figuring out which variables are actually continuous vs categorical even when everything is stored as numbers.

For single continuous variables, I got more comfortable with density plots, histograms, and box plots. Density plots are good for seeing the overall shape, but the bandwidth can really change how the plot looks, so it’s easy to over interpret noise. To me, histograms are easier to read but depend a lot on the bin choices. Box plots are useful for medians, quartiles, and outliers, but they don’t show shape very well. The skew examples made it clear to me why log transforms are useful, even though the reading log axes still takes practice.

With two continuous variables, things clicked more when we worked with joint and conditional probabilities directly in Pandas using boolean conditions and .mean(). Correlation also made more sense this week. I understand now that correlation measures linear relationships only, and zero correlation doesn’t mean two variables aren’t related at all.

I’m still unsure about how to choose good or ideal parameters for the plots, like histogram bins or density bandwidth, without just guessing. Overall, I believe that this week helped me think more carefully about how visualization choices can change the story the data seems to tell.

Comments

Popular Posts