Comme ce billet de blog n’est pas disponible en français, il sera affiché en anglais.
dplyr
is a robust package that enables simple and efficient data manipulation. It is part of the larger tidyverse
package collection aimed at streamlining R’s data science operations. In this blog post, we dive deep into the functionalities offered by dplyr
and demonstrate its capabilities with an example.
dplyr
dplyr
comes as part of the tidyverse
package bundle, but you can also install it on its own:
# Install dplyr separately
install.packages("dplyr")
# Or install the entire tidyverse
install.packages("tidyverse")
Once installed, load dplyr
into your R environment using:
library(dplyr)
dplyr
dplyr
has several functions that make data manipulation very easy and are very similar to SQL systax:
filter()
: Extract subsets of rows from a data frame based on a specific condition.select()
: Choose specific columns from a data frame.mutate()
: Add new variables/columns or modify existing ones in a data frame.summarise()
or summarize()
: Create summary statistics for different columns in a data frame.arrange()
: Rearrange rows in a data frame in ascending or descending order based on one or more columns.group_by()
: Group data by one or more variables, often used before summarise()
to create summaries for different groups.%>%
: The pipe operator, used to chain multiple functions in a sequence, enhancing readability and compactness of the code.dplyr
Let’s put dplyr
into action with a practical example. Here, we create and manipulate a data frame using various dplyr
functions:
library(dplyr)
# Create a sample data frame
data <- data.frame(
ID = 1:5,
Age = c(25, 30, 35, 40, 45),
Score = c(90, 85, 88, 92, 89)
)
# Use dplyr to manipulate the data frame
result <- data %>%
filter(Age > 30) %>%
mutate(Score_Adjusted = Score + 5) %>%
arrange(desc(Score_Adjusted))
This script will generate the following output:
ID Age Score Score_Adjusted
4 4 40 92 97
5 5 45 89 94
3 3 35 88 93
Score_Adjusted
, by adding 5 to each value in the Score
column.Score_Adjusted
column, giving us the final output.The dplyr
package in R offers an arsenal of functions to make data manipulation easy and efficient. Its integration with the tidyverse
makes it a vital tool in the data scientist’s toolkit, facilitating faster and more enjoyable data science operations in R. Explore the package’s documentation and tutorials for more detailed information and examples, and embark on your data science journey with dplyr
!