[Visualization] Seaborn 사용법

티스토리 뷰

MachineLearning/시각화

[Visualization] Seaborn 사용법

SweetDev 2022. 2. 3. 15:08

Intro

Matplotlib와 어떤점이 다를까 싶을 수도 있는데

다음 링크에 잘 설명되어 있다

실제 사용해본 느낌으로는, matplotlib이 가장 기초적이고 모든 부분이 custom 가능하지만 너무 신경써야할 부분이 많았고, 반대로 seaborn은 디자인과 사용이 매우 쉬웠지만 커스텀을 하기에는 어려웠다.

메모리 사용량도 seaborn이 더 높다고 링크에 설명되어 있었다.

Install and Import

! pip install seaborn
import seaborn as sns

종류 - API reference

Relational plot
- relplot, scatterplot, lineplot
Distribution
- displot, histplot, kdeplot, ecdfplot, rugplot, distplot
Categorical
- catplot, stripplot, swarmplot, boxplot, violinplot, boxenplot, pointplot, barplot, countplot
Regression
- lmplot, regplot, residplot
Matrix
- heatmap, clustermap
Multiples
- FacetGrid, pairplot, PairGrid, jointplot, JointGrid
Style, Color
- set_theme, set_palette, ...

Categorical

Count Plot (막대 그래프)

Parameters

x
y
data
hue
- hue_order
palette
color
saturate
ax

Code

sns.countplot(x='gender',data=student,
              hue='race/ethnicity', 
              hue_order=sorted(student['race/ethnicity'].unique()),
              saturation=0.3
             )

Displot (분포 그래프)

sns.displot(df, x="age", stat="density")
plt.show()

Box Plot (박스 그래프)

하지만 데이터가 정규분포에 가깝지 않다면 다른 방식으로 대표값을 뽑는 게 더 좋을 수 있습니다. 분위수란 자료의 크기 순서에 따른 위치값으로, 백분위값으로 표기하는 게 일반적입니다.

사분위수 : 데이터를 4등분한 관측값
- min, 25% (lower quartile = Q1), 50% (median), 75% (upper quartile = Q3), max
- “maximum”: Q3 + 1.5*IQR
- “minimum”: Q1 -1.5*IQR

https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51

이해가 힘들어서 normal distribution을 box plot에 그려보았다.

Code

fig, ax = plt.subplots(1,1, figsize=(10, 5))

sns.boxplot(x='race/ethnicity', y='math score', data=student,
            hue='gender', 
            order=sorted(student['race/ethnicity'].unique()),
            width=0.3,
            linewidth=2,
            fliersize=10,
            ax=ax)

plt.show()

Violin Plot

box plot은 대푯값을 잘 보여주지만 실제 분포를 표현하기에는 부족합니다.

이런 분포에 대한 정보를 더 제공해주기에 적합한 방식 중 하나가 Violinplot입니다.

이번에는 흰점이 50%를 중간 검정 막대가 IQR(25%) 범위를 의미합니다.

fig, ax = plt.subplots(1,1, figsize=(12, 5))
sns.violinplot(x='math score', data=student, ax=ax)
plt.show()

violin plot은 오해가 생기기 충분한 분포 표현 방식입니다.

데이터는 연속적이지 않습니다. (kernel density estimate를 사용합니다.)
또한 연속적 표현에서 생기는 데이터의 손실과 오차가 존재합니다.
데이터의 범위가 없는 데이터까지 표시됩니다.

이런 오해를 줄이고 정보량을 높이는 방법은 다음과 같은 방법이 있습니다.

bw : 분포 표현을 얼마나 자세하게 보여줄 것인가
- ‘scott’, ‘silverman’, float
cut : 끝부분을 얼마나 자를 것인가?
- float
inner : 내부를 어떻게 표현할 것인가
- “box”, “quartile”, “point”, “stick”, None

joint plot

pairplot

모든 feature과 feature사이의 관계를 시각화 해준다.

facet grid

feature와 feature 뿐만이 아니라, feature의 category들 사이의 관계도 살펴볼 수 있다!

`ddd`

https://www.geeksforgeeks.org/difference-between-matplotlib-vs-seaborn/

저작자표시

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

글 보관함

sweetdev

티스토리 뷰