1
$\begingroup$

There are 2 sns.pairplot, tell me how to interpret them.

enter image description here

enter image description here

As I understand it, sns.pairplot allows us to look at the diagonal distribution of these signs, and on the non-diagonal linear relationship between the signs, i.e. it is possible to identify in which space (a pair of signs) the classes will be well separated from each other. If you look at the picture, I understand that, on the first there are very few dependencies, unlike the second. By the way, this is confirmed by the correlation matrix. If so, is it necessary to remove one of the dependent signs? Are my statements true or not?

$\endgroup$
1

1 Answer 1

2
$\begingroup$

What is interresting with pair plots is that with one line of code we can see the relationships between each possible couples of features. The direct drawbacks are that it takes often quite a long time to run (though we can plot only half of the pair plot as it is symmetric to almost divide by 2 the run time) and the individual plots will be very small.

As a consequence it is advisable to save the plot as an image on your folder after the first run and then comment the block of code or the function call to the pair plot. Then you can open the image file and zoom/pan on it to study each plot.

For a classification model, using the hue parameter to differentiate the classes is also interesting as some patterns can appear. As the plots are very crowded I would try a lower alpha (transparency) parameter to better see the concentration of points/samples.

There are not always many interesting conclusions to draw from a pair plot but it is nonetheless a necessary step in my EDA for the reason above.

As an example lets's take the left top corner of the 2nd pair plot which shows the relationships between D1/D3/D4 (though we don't know the meaning of the features).

plot

We can notice on this example that:

  • there is no clear separation between classes (it would be too easy!)
  • we don't see a clear function between variables like linear, square root, exponential, ... that appears sometimes on pair plots.
  • looking at the distribution of the the features (the plots on the diagonal), we see that they are right-skewed (tail on the right). So when doing features engineering, I would add the log of these features to see if I get better results with them.
  • D3 <= D1. Except when D1 = 0. Knowing what D1 and D3 are, does it make sense ? why this exception for D1 = 0 ? Are these points outliers to remove or correct ?
  • on the plot between D4 and D3 the points are well spread but it seems to have a concentration of points where D4 = D3. Why ?
  • on the plot between D4 and D1, we see a few points with D4 < 0. Is it possible or should we remove these few points. Also it seems that for the minoriy class D4 doesn't go over 700, is there a reason why it could happen ? If so can we use this information ?

Also I would look in detail on the last row (assuming that the target is last feature in the list of features given to the pair plot) to look at the relationships between each feature and the target. But it is more useful for a regression problem, here as we want to classify samples into 2 classes boxplots would be more interesting than scatter plots to study the difference in distribution for each feature.

So to put it in a nutshell, pair plots can quickly make stand out characteristics that would be very difficult to see by only looking at the numbers (like an exponential relationship or outliers). And it also make us ask questions on what we see and by looking for answers we gain a better understanding of the data, which can lead to the discovery of interresting insights to help us build a better model.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.