Skip to main content

Command Palette

Search for a command to run...

NumPy Array Manipulation: Indexing, Slicing, Reshaping, Joining, and Splitting

Published
11 min read
NumPy Array Manipulation: Indexing, Slicing, Reshaping, Joining, and Splitting
E
Hey, It's me Eshan, I just love tech

In our previous deep-dive, we explored the hidden memory costs of standard Python lists and learned how to generate lightning-fast, fixed-type NumPy arrays from scratch.

But generating data is only the very first step. Data manipulation in Python is virtually synonymous with NumPy array manipulation. Even newer, incredibly popular tools like Pandas are fundamentally built directly on top of the NumPy array.

Whether you are cropping a bounding box out of an image for Computer Vision, appending a new column of features to a dataset, or splitting your data into training and testing sets for a Deep Learning neural network, you will be relying on these foundational array manipulations.

In this comprehensive guide, we will cover six core categories of array operations:

  1. Attributes of Arrays: Determining size, shape, memory consumption, and data types.

  2. Indexing of Arrays: Getting and setting the value of individual array elements.

  3. Slicing of Arrays: Getting and setting smaller subarrays within a larger array.

  4. Reshaping of Arrays: Changing the dimensional structure of an array.

  5. Joining Arrays: Combining multiple distinct arrays into a single structure.

  6. Splitting Arrays: Breaking a single array down into multiple smaller arrays.

Let's begin by generating some sample data.


1. NumPy Array Attributes: Inspecting Your Data

Before we manipulate arrays, we need to generate a few standard multi-dimensional arrays. We will use NumPy's random number generator.

Pro-Tip: The Random Seed Whenever you generate random data for machine learning, you should always set a seed. This ensures that the pseudo-random number generator produces the exact same "random" arrays every single time the code is run. This is critical for reproducibility when debugging models.

import numpy as np

# Seed the generator for reproducibility
np.random.seed(0) 

# Generate three different arrays
x1 = np.random.randint(10, size=6)           # 1D array (Vector)
x2 = np.random.randint(10, size=(3, 4))      # 2D array (Matrix)
x3 = np.random.randint(10, size=(3, 4, 5))   # 3D array (Tensor/Volume)

Every NumPy array comes with built-in attributes that allow you to instantly inspect its structure.

Dimensional Attributes

  • ndim: The number of dimensions (axes).

  • shape: A tuple representing the exact size of each dimension.

  • size: The total number of individual elements across the entire array.

print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

# Output:
# x3 ndim:  3
# x3 shape: (3, 4, 5)
# x3 size:  60

Memory Attributes

Knowing exactly how much RAM your dataset consumes is a vital skill. NumPy provides instant access to this metadata:

  • dtype: The exact data type of the elements (e.g., int64).

  • itemsize: The size (in bytes) of a single array element.

  • nbytes: The total size (in bytes) of the entire array.

print("dtype:", x3.dtype)
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")

# Output:
# dtype: int64
# itemsize: 8 bytes
# nbytes: 480 bytes

Mathematical check: nbytes is exactly equal to itemsize multiplied by size (8 x 60 = 480).


2. Array Indexing: Accessing Single Elements

If you are familiar with standard Python list indexing, NumPy's 1D indexing will feel entirely natural. It uses a zero-based index system.

One-Dimensional Indexing

# Our array: [5, 0, 3, 3, 7, 9]
print(x1[0])  # Output: 5 (The first element)
print(x1[4])  # Output: 7 (The fifth element)

You can also use negative indices to count backward from the end of the array. This is incredibly useful in time-series data when you want the "most recent" entry.

print(x1[-1]) # Output: 9 (The last element)
print(x1[-2]) # Output: 7 (The second to last element)

Multi-Dimensional Indexing (The NumPy Way)

This is where NumPy diverges from standard Python. If you have a list of lists in Python, accessing a nested element requires chaining brackets: my_list[0][1].

NumPy arrays use a much cleaner comma-separated tuple of indices.

# Our 2D array (x2):
# [[12,  5,  2,  4],
#  [ 7,  6,  8,  8],
#  [ 1,  6,  7,  7]]

print(x2[0, 0])  # Output: 12 (Row 0, Column 0)
print(x2[2, 0])  # Output: 1 (Row 2, Column 0)
print(x2[2, -1]) # Output: 7 (Row 2, Last Column)

Modifying Values and The Silent Truncation Pitfall

You can use standard index notation to overwrite elements.

x2[0, 0] = 12

⚠️ DANGER: The Fixed-Type Truncation Trap Unlike Python lists, NumPy arrays have a fixed data type. If you try to insert a floating-point value into an integer array, NumPy will silently truncate the decimal without throwing an error or warning.

# x1 is an integer array
x1[0] = 3.14159  

print(x1)
# Output: [3, 0, 3, 3, 7, 9]

Notice that 3.14159 became 3. If you do not monitor your dtypes, this silent truncation can completely ruin mathematical accuracy in a machine learning model!


3. Array Slicing: Accessing Subarrays

To access an entire sub-section of an array, we use slice notation, marked by the colon (:) character. The syntax universally follows this pattern:

x[start:stop:step]

If any of these are unspecified, they default to start=0, stop=size of dimension, and step=1.

One-Dimensional Subarrays

x = np.arange(10)
# Array: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

print(x[:5])   # First five elements: [0, 1, 2, 3, 4]
print(x[5:])   # Elements after index 5: [5, 6, 7, 8, 9]
print(x[4:7])  # Middle subarray: [4, 5, 6]
print(x[::2])  # Every other element (step by 2): [0, 2, 4, 6, 8]
print(x[1::2]) # Every other element, starting at index 1: [1, 3, 5, 7, 9]

Reversing an Array: A highly elegant trick in Python/NumPy is using a negative step value. When the step is negative, the defaults for start and stop are swapped, giving you a perfectly reversed array instantly.

print(x[::-1])  # All elements, reversed: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Multi-Dimensional Subarrays

Multi-dimensional slices follow the exact same logic, simply separated by commas.

# First two rows, first three columns
print(x2[:2, :3])
# Output:
# [[12,  5,  2],
#  [ 7,  6,  8]]

# All rows, every other column
print(x2[:3, ::2])
# Output:
# [[12,  2],
#  [ 7,  8],
#  [ 1,  7]]

# Reversing an entire 2D matrix (both rows and columns reversed)
print(x2[::-1, ::-1])

The Power of No-Copy Views

In standard Python lists, slicing creates a copy of the data. If you modify the slice, the original list remains untouched. NumPy array slices return views rather than copies. When you extract a subarray, you are simply looking at the exact same physical memory buffer through a smaller window. Modifying the slice modifies the original dataset! This is incredibly efficient for processing massive datasets "in-place" without crashing your RAM.

(If you explicitly need an isolated copy, use the .copy() method: x2[:2, :2].copy())


4. Reshaping Arrays

In machine learning, algorithms are incredibly strict about the dimensional shape of the data they receive. For example, Scikit-Learn expects a 2D matrix of features (samples, features), even if you only have one feature.

The most flexible way to alter dimensional structure is the reshape() method.

# Put the numbers 1 through 9 into a 3x3 grid
grid = np.arange(1, 10).reshape((3, 3))
print(grid)
# Output:
# [[1, 2, 3],
#  [4, 5, 6],
#  [7, 8, 9]]

Note: For reshape to work, the initial size must exactly match the reshaped size (9 = 3 x 3).

1D to 2D Conversion (Row and Column Vectors)

Converting a flat 1D array into a 2D row or column vector is a daily task in data engineering. You can use reshape(), or the visually explicit np.newaxis keyword.

x = np.array([1, 2, 3]) # Currently a 1D array of shape (3,)

# Convert to a 1x3 Row Vector 
x[np.newaxis, :]
# Output: array([[1, 2, 3]])

# Convert to a 3x1 Column Vector 
x[:, np.newaxis]
# Output: 
# array([[1],
#        [2],
#        [3]])

5. Joining Arrays: Concatenation and Stacking

Often, you will have multiple datasets that you need to merge. For instance, combining data from two different sensors, or adding a new column of engineered features to an existing matrix.

np.concatenate

The most basic joining routine is np.concatenate. It takes a tuple or list of arrays as its first argument.

x = np.array([1, 2, 3])
y = np.array([3, 2, 1])

# Joining two 1D arrays
np.concatenate([x, y])
# Output: array([1, 2, 3, 3, 2, 1])

# You can join more than two at once!
z = [99, 99, 99]
np.concatenate([x, y, z])
# Output: array([ 1,  2,  3,  3,  2,  1, 99, 99, 99])

When concatenating 2D arrays, you must pay attention to the axis parameter.

  • axis=0 (the default) stacks them vertically (adding rows).

  • axis=1 stacks them horizontally (adding columns).

grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

# Concatenate along the first axis (axis=0, vertical)
np.concatenate([grid, grid])
# Output:
# [[1, 2, 3],
#  [4, 5, 6],
#  [1, 2, 3],
#  [4, 5, 6]]

# Concatenate along the second axis (axis=1, horizontal)
np.concatenate([grid, grid], axis=1)
# Output:
# [[1, 2, 3, 1, 2, 3],
#  [4, 5, 6, 4, 5, 6]]

Stacking with Mixed Dimensions (vstack and hstack)

np.concatenate can be strict and confusing when you are trying to combine arrays of different dimensions (like putting a 1D array on top of a 2D matrix). For these tasks, it is vastly cleaner to use np.vstack (vertical stack) and np.hstack (horizontal stack).

x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# Vertically stack a 1D array onto a 2D grid
np.vstack([x, grid])
# Output:
# [[1, 2, 3],
#  [9, 8, 7],
#  [6, 5, 4]]

# Horizontally stack a column vector to a 2D grid
y = np.array([[99],
              [99]])
np.hstack([grid, y])
# Output:
# [[ 9,  8,  7, 99],
#  [ 6,  5,  4, 99]]

(There is also np.dstack which stacks arrays along the third axis, representing depth.)


6. Splitting Arrays

The exact opposite of concatenation is splitting. In Machine Learning, this is the fundamental operation used to break a massive dataset into a "Training Set" and a "Testing Set", or to separate your Features (X) from your Target Labels (y).

The routines are np.split, np.hsplit (horizontal), and np.vsplit (vertical).

Instead of telling NumPy how many arrays you want, you pass a list of indices representing the split points.

The Golden Rule of Splitting: N split points will always lead to N + 1 subarrays.

x = [1, 2, 3, 99, 99, 3, 2, 1]

# We pass two split points (index 3 and index 5).
# This results in 3 separate arrays.
x1, x2, x3 = np.split(x, [3, 5])

print(x1) # Elements up to index 3 (not inclusive): [1, 2, 3]
print(x2) # Elements from index 3 up to index 5:    [99, 99]
print(x3) # Elements from index 5 to the end:       [3, 2, 1]

Splitting Multi-Dimensional Grids

The specialized directional splitters (vsplit and hsplit) are perfect for 2D matrices.

grid = np.arange(16).reshape((4, 4))
# grid is:
# [[ 0,  1,  2,  3],
#  [ 4,  5,  6,  7],
#  [ 8,  9, 10, 11],
#  [12, 13, 14, 15]]

# Split vertically after the 2nd row (index 2)
upper, lower = np.vsplit(grid, [2])
print(upper)
# [[0 1 2 3]
#  [4 5 6 7]]

print(lower)
# [[ 8  9 10 11]
#  [12 13 14 15]]


# Split horizontally after the 2nd column (index 2)
left, right = np.hsplit(grid, [2])
print(left)
# [[ 0  1]
#  [ 4  5]
#  [ 8  9]
#  [12 13]]

(Similarly, np.dsplit will split 3D arrays along the third depth axis).


Free Resources to Dive Deeper

Mastering these manipulations takes practice. If you want to test these exact functions and read more about the computer science behind them, check out these free resources:


Hmm, I think i have a good reading speed :0

Data Science

Part 5 of 6

Learn data science through practical, beginner-friendly posts covering Python, NumPy, pandas, Matplotlib, data cleaning, analysis, visualization, and essential workflows. This series is designed to help you understand how raw data becomes meaningful insight.

Up next

The Definitive Guide to NumPy: Memory Architecture, Dynamic Typing, and Array Creation

Before you can train a machine learning model, visualize a dataset, or perform complex statistical analysis, you must understand how to handle data. Datasets come in a massive variety of formats: coll