Tech

Neural representation of action symbols in primate frontal cortex

flipbeans

Posted on

May 21, 2026

Subjects and surgical procedures

Data were acquired from two adult male macaques (Macaca mulatta, average weight of 17 kg (subject 1 (S1)) and 10 kg (S2), average age of 9 years (S1) and 7 years (S2)). This sample size was chosen to match the standard for neural recording studies of behaviour in monkeys^22,24. All animal procedures complied with the NIH Guide for the Care and Use of Laboratory Animals and were approved by the Institutional Animal Care and Use Committee of the Rockefeller University (protocol 24066-H).

After undergoing initial task training in their home cages, the subjects underwent two surgeries: the first to implant an acrylic head implant with a headpost and the second to implant electrode arrays. Both surgeries followed standard protocols, including for anaesthetic, aseptic and postoperative treatments. In the first surgery, a custom-designed MR-compatible Ultem headpost was implanted, surrounded by a bone cement cranial implant, or a headcap (Metabond, Parkell and Palacos, Heraeus), which was secured to the skull using MR-compatible ceramic screws (Rogue Research). After a 6-month interval, to allow bone to grow around the screws and for the subject to acclimate to performing the task during head fixation via the headpost, we performed a second surgery to implant 16 floating microelectrode arrays (32-channel FMA, Microprobes for Life Science) using standard procedures⁶⁹. In brief, after performing a craniotomy and durotomy over the target area, arrays were inserted one by one stereotactically while held at the end of a stereotaxic arm with a vacuum suction attachment (Microprobes). Using vacuum suction enabled us to release the arrays, after insertion, with minimal mechanical perturbation by turning off the suction. After all arrays had been implanted, the dura mater was loosely sutured and covered with DuraGen (Integra LifeSciences). The craniotomy was closed with bone cement.

We used standard density arrays (1.8 mm × 4 mm) for all areas, except SMA and preSMA, for which we used four high-density arrays (1.6 mm × 2.95 mm). Four additional electrodes on each array served as the reference (two electrodes) and the ground (two electrodes). Two arrays each were targeted to multiple areas of the frontal cortex, with locations identified stereotactically, and planned using brain surface reconstructions derived from anatomical MRI scans (3D Slicer 5.6.2). Locations were selected on the basis of their published functional and anatomical properties (see below), anatomical sulcal landmarks and a standard macaque brain atlas⁷⁰. During surgery, locations were further adjusted on the basis of cortical landmarks and to avoid visible blood vessels. Arrays were implanted in the right hemisphere (contralateral to the arm used for behaviour).

Array locations are depicted in Fig. 3 and Extended Data Fig. 6 and confirmed with intraoperative photographs. For M1, we targeted hand and arm representations (F1) directly medial to the bend of the central sulcus (which corresponds roughly to the intersection of the central sulcus and the arcuate spur if the latter was extended caudally), based on retrograde labelling from the spinal cord and microstimulation of M1⁷¹ and M1 recordings⁷². For PMd, we placed both arrays lateral to the precentral dimple, with one (more caudal) array directly medial to the arcuate spur (the arm representation^71,72,73, F2), and the other more rostral (straddling F2 and F7). For PMv, we targeted areas caudal to the inferior arm of the arcuate sulcus (F5), which are associated with hand movements based on retrograde labelling from the spinal cord⁷¹ and M1⁷⁴, microstimulation⁷⁵ and functional studies^51,76,77 and with decision making^77,78. These areas contain neurons interconnected with PFC⁷⁴. For SMA (F3) and preSMA (F6), we targeted the medial wall of the hemisphere, with the boundary between SMA and preSMA defined as the anterior–posterior location of the genu of the arcuate sulcus, consistent with previous studies finding differences across this boundary in anatomical connectivity (for example, direct spinal projections in SMA but not preSMA⁷⁹) and function^42,80. SMA arrays were largely in the arm representation⁷⁹. For dlPFC, we targeted the region immediately dorsal to the principal sulcus (46d), following previous studies of action sequencing^27,28,81 and other cognitive functions⁸². For vlPFC, we targeted the inferior convexity ventral to the principal sulcus, with one (more rostral) array directly ventral to the principal sulcus (46v) and the other rostral to the inferior arm of the arcuate sulcus (45A/B) based on evidence that encoding of abstract concepts occurs in regions that broadly span these two locations^29,83,84, including a possibly heightened role (compared with dlPFC) in encoding abstract concepts in a manner invariant to temporal or spatial parameters^84,85,86. For FP, we targeted a rostral location similar to previous recording and imaging studies (one array fully in area 10, the other straddling areas 9 and 10)^87,88, including areas associated with executive functions⁸⁹. In general, array locations targeted the cortical convexity immediately next to sulci, instead of in the banks, to allow shorter insertion depths that minimize the risk of missing the target or damaging blood vessels. The exceptions were SMA and preSMA in the medial wall, for which this was not possible. To avoid damaging the superior sagittal sinus, we positioned the arrays laterally (2 mm from midline) and slanted the electrodes medially (Extended Data Fig. 6).

The lengths of each electrode were custom designed to target half-way through the grey matter and to substantially vary across the array to maximize sampling of the cortical depth. The following electrode lengths were used (in mm): 1.5−3.5 (M1), 1.5−3.1 (PMd and PMv), 2.8−5.8 (SMA and preSMA), 1.5−2.5 (dlPFC and vlPFC) and 1.5−2.6 (FP) for S1; and 1.7−3.75 (M1), 1.5−3.3 (PMv), 1.5−3.1 (PMd), 2.65−5.95 (SMA and preSMA), 1.75−3.15 (dlPFC), 1.35−3.2 (vlPFC) and 1.6−2.9 (FP) for S2. Reference electrodes were longer (6 mm) to anchor the arrays. All electrodeswere Pt/Ir (0.5 MΩ), except 4, which were Ir (10 kΩ). Array connectors (Omnetics, A79022) were housed in custom-made Ultem pedestals (Crist), which were secured with bone cement onto the cranial implant. Four pedestals were used per subject, holding 5, 5, 4 and 2 connectors each.

Behavioural task

Task overview

The subjects were seated comfortably in the dark with their head restrained by headpost fixation. They faced a touchscreen (Elo 1590L 15-inch E334335, PCAP, 768 × 1,024 pixels, refresh rate of 60 Hz, with a matte screen protector to reduce finger friction) that presented images and was drawn on. The touchscreen location was optimized to allow each subject to easily draw at all relevant locations on the screen (23–26 cm away; diagram in Extended Data Fig. 1). Both subjects decided on their own over the course of learning to perform the task with the left hand. The chairs were designed to minimize movements of the torso and legs (by using a loosely restricting ‘belly plate’) and the non-drawing arm (by resting on the belly plate and having movement restricted to within the chair). Gravity-delivered reward (water–juice mixture) was controlled by the opening and closing of a solenoid pinch valve (Cole-Parmer, 1/8-inch inner diameter). The subjects were water-regulated, with careful monitoring to ensure that consumption met the minimum requirement per day (typically exceeding it), and body weight was closely monitored to ensure good health. The task was controlled with custom-written software using the MonkeyLogic (v.2.2.45) behavioural control and data acquisition MATLAB package⁹⁰ (PC: Windows 10 Pro, Intel Core i7-4790K, 32GB RAM; DAQ: National Instruments PCIe-6343). All stimuli (images of line figures defined as point sets, with points rendered large enough to appear as continuous curves) were also generated with custom-written MATLAB (R2021a) code. Images were presented in a workspace area on the screen (16.6 cm × 16.9 cm, corresponding to approximately 37° by 38° visual angle). Shape components in images were on average 4.0 cm (9°) (maximum of width and height).

Each recording session consisted of 2−3.5 h of recording. We collected 5–20 trials per condition (that is, each unique image for Figs. 4 and 5 and the single-shape task in Fig. 6, and each primitive stroke for the character task in Fig. 6). All trials were shuffled across all conditions in the session and presented in a randomly interleaved fashion, except for one case, the experiment in Fig. 6, in which character and single-shape tasks were switched in blocks.

Early training

Before surgery, the naive subjects underwent initial training on core task components (that is, to trace images accurately using a sequence of discrete strokes). Early training took place in the home cage using custom-built rigs that were attached to an opening in the cage using the same hardware and software described above, except for the computer (Lenovo IdeaPad 14-inch laptop, Windows 10, AMD Ryzen 5 3500U, 8GB RAM) and DAQ (National Instruments USB-6001). This initial training progressed through seven stages. (1) Touch circle. The subjects were rewarded for touching a circle anywhere within its bounds. The circle started large, filling the entire screen, and shrank over trials to enforce more accurate touches. (2) Touch with a single finger. We shrank the circle until it was so small that it could only be touched with a single finger. The trial aborted if the subject touched outside the circle or with multiple fingers simultaneously. (3) Hold still. The subjects were rewarded for keeping their fingertip still on a dot, with the duration of this hold increasing across trials for up to a few seconds. (4) Track moving dot. The subjects had to track the dot with their finger as it moved (a lag between dot and finger was allowed). (5) Trace a line. We increased the speed of the moving dot over trials until eventually the dot moved so fast that the line it traced immediately appeared. We then positioned the line at locations far from the hold position to train the subject to raise its finger from the hold position and to trace lines at arbitrary locations, angles and lengths. (6) Trace single shape. We presented shapes of increasing difficulty (gradually morphing across trials from a straight line), including arcs, L-shapes, squiggles and circles. Across these stages, the shapes were presented at random locations. We did not enforce any particular tracing trajectory for each shape, which allowed the subjects to choose on their own. (7) Trace multiple shapes. We presented images composed of multiple disconnected shapes. This trained the subjects to understand that they should use multiple strokes to trace multiple shapes. At this point, the subject understood the basic structure of the task: to trace shapes using multiple strokes if needed. The progression across these stages was not determined by strict quantitative criteria but instead on a combination of quantitative and qualitative evaluations of task performance.

The subjects then practised various tasks to incentivize the learning of stroke primitives (consistent stroke trajectories for each shape). They practised single-shape trials using the set of diverse simple shapes in Fig. 1e, varying randomly in shape and location across trials. We chose this set of shapes to cover a range of trajectory profiles (by varying rotation, the number of direction changes and whether shapes were curved or linear) and yet were simple enough to draw with one stroke and to combine multiple (two to six) shapes into single-character images. We did not constrain the subjects to learn specific stroke trajectories for each primitive; therefore, differences between primitives reflected each subject’s own learning trajectory for how to draw each shape. Note that our study did not depend on using an optimal set of shapes or primitives but instead depended on learning primitives and then demonstrating behavioural generalization using these primitives, as we found for our subjects. For S2, the four S-shapes and four arc shapes were also sometimes presented as rectilinear versions. For S-shapes these resembled Z, and for arc these resembled squares missing one edge. S2 drew these rectilinear versions using curved strokes similar to those used for S and arc shapes. Therefore we combined data for each rectilinear shape with its respective curved version. On different days, the subjects also practised multishape and character tasks.

Trial structure

Trials (event sequence in Fig. 1c, screen schematic in Extended Data Fig. 1c) began when the subject pressed and held a finger fixation button (blue square) at the bottom of the screen (note that button always means a virtual button). After a random delay (uniform, earliest 0.4–0.6 s and latest 0.8–1.0 s across experiments for S1 and 0.8–1.1 s for S2) the image appeared (dark grey on a light-grey background). After a random delay (uniform, ranging from 0.6–1.0 s to 1.2–1.6 s across experiments for S1, and 1.1–1.5 s to 1.8–2.4 s for S2), a go cue (400 Hz tone and image blank for 300 ms) was presented. During this delay between image presentation and the go cue (planning epoch), the finger had to be kept still on the fixation button, but the subject was free to look anywhere. After the go cue, the subject was free to move their hand towards the image and start drawing. Immediately after the finger was raised from the fixation button, it disappeared and a ‘done button’ (green square) appeared at its location and stayed there. During drawing, the image stayed visible and the finger left a trail of black ‘ink’ on the screen. The subject signalled drawing completion by pressing the done button (effectively no time limit was imposed). This was followed by performance feedback, which spanned four modalities, each signalling performance: (1) screen colour, (2) sound, (3) duration of delay before getting reward (time out) and (4) reward. First, screen colour and sound were signalled, followed by the time out, and then the reward (detailed below). In addition to this feedback at the end of the trial, we also provided online feedback by immediately aborting the trial in case of serious errors; for example, touching far from any image points or for single-shape and multishape trials (but not for character trials), using more than one stroke per shape component. These online abort modes were turned off for trials testing novel characters.

Screen image changes (including image presentation and other trial events) were recorded using photodiodes (Adafruit Light Sensor ALS-PT19), and sounds were recorded using an electret microphone (Adafruit Maxim MAX4466, 20-20KHz). We performed eye tracking (ISCAN), but did not enforce eye fixation.

Scoring behavioural performance

Behaviour was scored by aggregating multiple metrics or factors. There were three classes of factors. The factors that had the greatest influence on the final aggregate score measured image similarity, or the similarity of the final drawing to the target image (ignoring its temporal trajectory). We also computed factors that reflected behavioural efficiency and, in some cases, factors that were task-specific. These scores were computed using behavioural data (a sequence of touched xy coordinates with gaps between strokes) and image data (a set of xy coordinates). Below, we describe the factors and then how they were aggregated into a single score.

Image similarity

This included two factors: drawing-image overlap and Hausdorff distance. Drawing-image overlap was the fraction of the image points that were touched (within a margin of error) by at least one of the drawn points. A subset of the image points were weighted more heavily because they captured characteristic features of the shape (for example, the corners and end points of an L-shape). Hausdorff distance is a metric used to measure the distance between the set of drawn points and the set of image points (definition below).

Behavioural efficiency

To incentivize efficiency, we included a factor that compares the cumulative distance travelled in the drawing (that is, the amount of ink) to the cumulative distance of the edges of the figure in the image, with its value negatively proportional to the excess of drawn ink over image ink.

Task-specific factors

During practice trials for the character task (see the section ‘Task types’), we also included factors that capture the extent to which drawn strokes matched the shapes used in the image. This included two factors: one proportional to the similarity of the number of strokes and the number of image shapes, and the other proportional to the spatial alignment of the drawn strokes to the image shapes. Importantly, these factors were included only for practice images and not for novel test images.

The final score aggregated the image similarity, behavioural efficiency and task-specific factors, with more weight on image similarity factors. We first rescaled factors linearly between 0 and 1 (where 1 means good performance), with the dynamic range set by a lower and upper bound. These bounds were adaptively updated on every trial based on the distribution of factor values in the last 50 trials (lower bound set to the 1st percentile and upper bound to the 53rd percentile), which ensured that the dynamic range of feedback matched the dynamic range of behavioural performance from recent history. We then weighted each factor to tune its relative contribution (using weights hand-tuned for each experiment; generally highest for image similarity) and computed the final scalar score (range 0 to 1) using the worst factor after weighting:

$$_A=mathoplimits_{i}(1-{w}_{i}(1-{f}_{i}))$$

where i indexes the factors, w_i are the weights (between 0 and 1) and f_i are the factor values (between 0 and 1). We also gave each trial a categorical score: great (s_scal > 0.82), good (0.65 < s_scal ≤ 0.82), OK (0.15 < s_scal ≤ 0.65) or fail (s_scal ≤ 0.15).

The scalar and categorical scores determined the feedback across the four different modalities. The meaning of screen colour and sound were learned, whereas delay and reward had intrinsic value. For screen colour, a linear interpolation between two colours, such that a score of 0 was mapped to red (RGB: 1, 0.2, 0) and 1 was mapped to green (0.2, 1, 0.2). For sound cue, a sound was determined by s_scal: if great, then three pulses (1,300 Hz, 0.16 s on and off); if good, then a single pulse (1,000 Hz, 0.4 s); if OK, then no sound; if fail, then a single pulse (120 Hz, 0.27 s). For delay until reward, a nonlinear mapping was generated from score to delay before reward. We first applied a linear mapping from the scalar score, such that a score of 0 was mapped to a long delay (5 s + a random uniform jitter of 0–2.5 s), and a score of 1 was mapped to 0 s of delay. Furthermore, if s_scal was great, good or OK, this delay was reduced by multiplying by 0.65. Finally, for reward, the open duration of the solenoid gating the juice line was defined as

$${rm{r}}{rm{e}}{rm{w}}{rm{a}}{rm{r}}{rm{d}}={C}times mtimes atimes {s}_{{rm{s}}{rm{c}}{rm{a}}{rm{l}}}$$

where C is a constant in dimensions of time (0.15−0.6 s, manually set depending on the difficulty of the task); m is a multiplier that gives a bonus for good performance and further penalizes bad performance, depending on the value of s_scal, great (1.3), good (1.0), OK (0.8) or fail (0); a is a random variable sampled from the uniform distribution a ~ 0.75 + 0.5 × U(0,1); and s_scal is defined as above. On average, including failed trials, the subjects received around 0.35 ml reward per trial. The temporal order in which these four feedback signals were delivered is described above (see the section ‘Trial structure’).

Task types

Single-shape task

The single-shape task presented one of the practised simple shapes or, in the categories experiment, sometimes a morphed shape. The subjects were only allowed to use a single stroke (triggering online abort if more was used). In four single-shape sessions for S1, the ending of the drawing epoch was triggered by stroke completion (that is, on finger raise) not on the pressing of the done button as in all other sessions and experiments.

To test for motor invariance (Figs. 2a–e and 4), we presented images of practised shapes, varying across trials in location, size or both. For location variation, images spanned 321 pixels (9.6 cm) in the x and y dimensions (measuring between shape centres), which is 2.38 times the average size of shapes (135 pixels, 4.0 cm, maximum across width and height). For size variation, the maximum size was 2.5 times larger than the smallest (in diameter), except for two experiments for S1, in which the ratio was 2.0. The location and size variation in Fig. 2b,c is representative (S1, n = 2 sessions varying in location, n = 3 varying in size; S2, n = 3 sessions varying in location, n = 2 varying in size).

Size and location variation was chosen based on previous studies of M1 activity and electromyography of muscles controlling the arm during reaching in macaques^91,92. For reaches performed along the coronal plane around 25 cm from the subject, similar to the geometry of the touchscreen relative to the subject in our study, M1 and electromyographic activity are substantially affected by translating the reach location by 10 cm (ref. ⁹¹), similar to the location variation in our study (9.6 cm), and by varying the scale of the reach by 2-fold (7 cm to 12 cm)⁹², similar to the size variation in our study (2–2.5-fold).

To test for categorical structure (Figs. 2f–j and 5), we constructed morph sets (S1, n = 7 morph sets across 3 sessions; S2, n = 13 morph sets across 4 sessions), each consisting of two practised shapes and four to five images that morph between those shapes through linear interpolation along shape parameters, such as the extent of closure of the top of the U (Fig. 2g). Across morph sets, we varied different image parameters (Extended Data Fig. 3).

Multishape task

Each image was composed of two to four shapes positioned at random, nonoverlapping, locations spanning the space of the screen (possible locations include the four corners and the centre). The subjects were allowed to draw the shapes in any order and to use any trajectory for each shape, but were constrained to use one stroke per shape and to not trace in the gaps between shapes. We present results averaged across two sessions, one from each subject (Extended Data Fig. 4). For each trial, an image was constructed by sampling shapes randomly without replacement. This led to n = 531 (S1) and n = 278 (S2) unique images.

Character task

Each image was generated by connecting two to six simple shapes into a single character by sampling from a generative model as follows. A character with N shapes was defined by randomly sampling N shapes and N – 1 relations, where each relation (indexed i) defines the locations of the attachment points on shapes i and i + 1, which in turn define how the shapes connect to each other. This approach is similar to a previous generative model for handwritten characters⁹. Generated characters were only kept if there was minimal crossing of shapes over each other.

For experiments testing behavioural generalization to novel characters (Fig. 2k–p), we mixed practised and novel characters (practised, n = 189 (mean, range 22–491) per day; novel, 48 (mean, range 0–155) per day). For analyses, we labelled as ‘novel’ only the very first trial for a given character. Because of random sampling in generating characters, it would in principle be possible that characters generated on different days are in fact identical. To avoid this possibility, we ensured post hoc that all characters labelled novel were different from every previously encountered character across all days (quantified using the Hausdorff distance).

For neural experiments comparing single-shape and character tasks (Fig. 6), we analysed the sessions for which we collected data from both the single-shape and character tasks (S1, n = 9 sessions, median N matching primitives between single-shape and character tasks = 9 (range 5−12); S2, n = 9 sessions, median N matching primitives = 10 (range 2−14)). We switched between single-shape and character tasks using a block design (2−5 blocks each per session), except one session for S2, which used random interleaving across trials.

Behavioural data analysis

Preprocessing of touchscreen data

Touchscreen data were represented as time series of (x, y) coordinates in units of pixels (conversion: 33.6 pixels per cm) and sampled at 60 Hz, which we upsampled to 500 Hz (performed in MonkeyLogic to align all behavioural signals, including trial event markers and eye tracking) and low-pass filtered to keep only drawing-related movements (15 Hz). Strokes were segmented based on the time of first touch (onset) and the time of last touch (offset) with 500 Hz resolution.

For some analyses (Fig. 1f and as input to the trajectory distance below) we further computed stroke instantaneous velocity and speed as follows. Extracted strokes were low-pass filtered (12.5 Hz) and downsampled to 25 Hz. We then used the standard five-point stencil method to compute a finite difference approximation of the derivative (separately for the x and y coordinates):

$${f}^{{prime} }[n]=frac{f[n-2]-8f[n-1]+8f[n+1]-f[n+2]}{12,{text{h}}}$$

where f[n] is a discrete time series (that is, the x or y coordinates) indexed by integer n, and h is the sampling period in seconds. The resulting velocity time series was upsampled to the original 500 Hz sampling rate with a cubic spline. Speed was computed as the norm of the (x, y) velocity at each time point.

Computing the trajectory distance

To quantify the similarity between any two strokes based on their spatiotemporal trajectories while ignoring their relative size and location, we devised a trajectory distance metric, a scalar dissimilarity score based on the dynamic time warping distance between two strokes represented as velocity time series v₁ and v₂. To compute trajectory distance between two strokes, we spatially rescaled each stroke (while maintaining its xy aspect ratio) to make the diagonal of its bounding box unit length 1. We then linearly interpolated each stroke to the same number of points (70) to allow point-by-point comparison between strokes. This was done spatially by interpolating based on fraction of cumulative distance travelled (so that the distances between successive points were the same over the entire stroke) to capture the spatiotemporal trajectory, as in a previous analyses of strokes in handwriting⁹. Interpolated trajectories were converted to velocity time series as described above. We then computed the dynamic time-warping distance between velocities v₁ and v₂:

$${D}_{text{DTW}}({v}_{1},{v}_{2})=frac{{text{min}}_{{rm{pi }}}{sum }_{(i,j)in {rm{pi }}}d(i,j)}{{text{N}}}$$

where i and j index the two velocity trajectories, N is the number of points (70), and π is a set of (i, j) pairs representing a contiguous path from (0, 0) to (N, N). The local distance metric d(i, j) is the Euclidean distance plus a regularization factor to discourage excessive warping:

$$d(i,j)=|{v}_{1}[i]-{v}_{2}[,j]|+lambda times |i-j|,$$

$$lambda =0.045{langle |{v}_{n}[i]|rangle }_{i,n}.$$

For the regularization parameter, λ, the purpose of the summation term was to rescale it to match the magnitude of velocities. The resulting distance D_DTW was then rescaled between 0 and 1 to return the trajectory distance:

$${D}_{mathrm{traj}}({v}_{1},{v}_{2})=1-frac{1}{{D}_{mathrm{DTW}}({v}_{1},{v}_{2})+1}.$$

Computing the image distance

To compare the similarity of two images—each a set of (x, y) points—we used a modified version of the Hausdorff distance, a distance metric commonly used in machine vision for comparing the similarity between two point sets based on shape attributes⁹³. There are, in principle, at least 24 variants of the Hausdorff distance based on possible formula variations⁹³. Here we used a variant that is minimally susceptible to outlier points because it takes means instead of minima and maxima (variant 23 in the referenced study⁹³). Image distance was computed as follows: (1) each image was centred so its centre of mass was at (0, 0); (2) the image distance was then computed. First, we defined the distance between two points, d(a, b), as the Euclidean distance. We also defined the distance between a point and a set of points, d(a, B), and the distance from set A to set B, d(A, B), as follows:

$$d(a,B)=mathop{min }limits_{bin B},d(a,b),$$

$$d(A,B)=frac{1}{|A|}sum _{ain A}d(a,B).$$

The image distance was then defined as:

$${D}_{mathrm{image}}(A,B)=frac{d(A,B)+d(B,A)}{2}.$$

Computing the primitive alignment score

For experiments on categorical structure, we generated a set of images with each set containing four to five novel images that morph between one primitive (P1) and another primitive (P2) to create a morph set. Each trial presented a single image from one morph set. We sought to quantify the relative similarity between the data of a given trial—its behavioural, image or neural data (see below)—and data for the two primitives, P1 and P2, in its morph set. To do so, we devised a primitive alignment score defined as:

$$a=frac{{d}_{1}}{{d}_{1}+{d}_{2}}$$

where d₁ is the average of the distances between a given trial and each of the P1 trials, and d₂ the average distance to the P2 trials. A score closer to 0 implies similarity to P1, and a score close to 1 implies similarity to P2 (note that in practice the primitive alignments for P1 and P2 data are not exactly 0 and 1 owing to trial-by-trial variation). The particular metric used for these distances depended on the analysis: image distance (for images), trajectory distance (for drawings) or Euclidean distance between population activity vectors (for neural activity). We confirmed that primitive alignment scores for image data varied linearly with morph number (Fig. 2j and Extended Data Fig. 3c), which ensured that any deviation from linearity in behavioural or neural data could not trivially be the consequence of how the score is defined.

Classifying strokes from the character task

To assess whether the subjects drew characters by reusing their own stroke primitives, we scored the fraction of character strokes that were high-quality matches to the subject’s own primitives and the fraction that were high-quality matches to the other subject’s primitives. If the fraction of matches to a subject’s own primitives was high, and to the other subject’s primitives was low, then this was evidence for recombining the subject’s own primitives.

This assessment was done by assigning each stroke the label of its most similar primitive using the trajectory distance and then defining this as a high-quality match only if the trajectory distance was sufficiently low (defined below).

First, each stroke was assigned its best-matching primitive, p*, from a set of primitives (the choice of primitive set—same or different subject—depending on the analysis; see below):

$${p}^{ast }=mathop{text{argmin}}limits_{p},d(s,{mu }_{p})$$

where p indexes the primitives, s is the stroke trajectory, µ_p is the mean trajectory for primitive p (averaged over trials from the single-shape task), and d(.,.) is the trajectory distance.

Second, the quality of the assignment of the stroke to p* was scored as high if (d(s,{mu }_{{p}^{ast }}) < {{D}}_{max,{{text{p}}}^{ast }}) and low if (d(s,{mu }_{{p}^{ast }})ge {{D}}_{max,{p}^{ast }}), where ({{D}}_{max,{{text{p}}}^{ast }}) is an upper bound on trajectory distances that would be expected from trial-by-trial variation in primitive p*. It is the 97.5th percentile of the distribution of trajectory distances from single-shape trials, determined separately for each primitive, which we consider as a good (arguably conservative) estimate of trial-by-trial variation. This is because the single-shape task presents no ambiguity as to what primitive needs to be drawn and can therefore be considered ‘ground truth’.

These steps assigned each stroke a class tuple (p*,quality), where high quality was interpreted as the stroke matching the primitive set and low quality meant failure to match any primitive. Note that the non-uniformity of the frequency distribution of matches across primitives (Fig. 2o) resembles non-uniformity seen for language and other behaviours⁹⁴. Also, note that even the primitives with a low frequency of match (Fig. 2o) can be considered legitimate matches because they satisfied the criteria to be labelled as high quality.

In summary analyses, we pooled all cases of high-quality matches into a single match class (regardless of the assigned primitive), and all low-quality matches into a single no-match class (Fig. 2p). To test whether a given subject’s character strokes aligned better with its own primitives compared with the other subject’s primitives (Fig. 2p), we performed the above analysis separately for all four combinations of stroke data (2 subjects) × primitives (2 subjects) using only images drawn by both the subjects.

The control analysis of testing primitive reuse using a simulated set of remixed primitives was performed as follows. We generated remixed primitive sets by mixing subparts of different primitives. Given two actual primitives, we extracted the first half of the first primitive (defined by the distance travelled) and the second half of the second primitive and connected them by aligning the offset of the first half to the onset of the second half, smoothing the connection with a sigmoidal weighting function. To sample a remixed primitive set, we first generated a pool of all possible remixed primitives using every possible pair of actual primitives. We then sampled (without replacement) a set of remixed primitives from this pool, keeping only remixed primitives that satisfied the following constraints: (1) no self-intersection, and (2) no excessively sharp turns, detected as curvature at any point along the inner 80% of the stroke exceeding 0.8. Curvature was defined in a standard manner as the inverse of the radius of curvature,

$$frac{dot{x}ddot{y}-dot{y}ddot{x}}{{({dot{x}}^{2}+{dot{y}}^{2})}^{frac{3}{2}}}$$

where (dot{x}) and (dot{y}) are the velocity components and (ddot{x}) and (ddot{y}) are the acceleration components. Once a candidate set of remixed primitives was sampled, it had to pass further constraints: (1) each actual primitive contributed its first half or second half to a maximum of two remixed primitives; (2) the trajectory distance between no pair of remixed primitives was lower than the minimum trajectory distance between actual primitives; and (3) the trajectory distance between no pair of remixed primitive and actual primitive was less than the minimum trajectory distance between actual primitives. These constraints ensured that remixed primitives in each set were different from each other and from the actual primitives (visually apparent in Extended Data Fig. 5).

We then labelled strokes from character drawings using each remixed primitive set using the same approach as for actual primitives, except the following. The analysis using actual primitives used trajectory distance thresholds (({{D}}_{max,{{text{p}}}^{ast }})) determined empirically for each primitive based on single-shape trials (see above). Because remixed primitives were never drawn, a similar approach could not be used. Instead, thresholds for remixed primitives were assigned from the pool of thresholds for the actual primitives, in a manner meant to increase the stroke–primitive match rate (and thus allow a stronger, or more conservative, test that the remixed primitives do not match the character strokes) by assigning the largest (most lenient) threshold to the worst-matching remixed primitive, the second largest to the second worst, and so on, where worse means having larger average trajectory distance to character strokes.

Analysis of kinematic separability of primitives

To test whether each primitive was decodable from every other primitive based on single-trial kinematics (stroke trajectory) (Extended Data Fig. 2), strokes were first converted from time series (x and y position) to eight-dimensional vectors as follows. Similar to the trajectory distance computation (above), strokes were normalized in time (linear interpolation to 50 time points) and space (rescaling each stroke, while maintaining its xy aspect ratio, to make the diagonal of its bounding box unit length 1). The x and y coordinates were concatenated to a 100-dimensional vector and then reduced to 8 dimensions using principal component analysis (PCA). Decoding was performed on these 8D representations using a linear support vector machine classifier (SVC) (LinearSVC, scikit-learn (v.1.3.0), regularization parameter C set to 0.1), with 10-fold cross-validation. By performing this procedure for every pair of primitives (each time returning a single decoding accuracy score), we populated the matrix shown in Extended Data Fig. 2.

Neural recordings

Recordings were acquired using a Tucker-Davis Technologies (TDT) system, including headstage (Z-Series 32 Channel Omnetics, LP32CH-Z), amplifier (PZ5M-256), processor (RZ2) and storage (RS4), sampled at 25 kHz (local reference mode), controlled with TDT Synapse (v98) software run on a Windows 10 PC (Intel Core i7-3770, 32GB RAM) and saved to disk. Analog and digital task-related signals, including behavioural events (photodiode, audio and trial event markers) and eye tracking (ISCAN, 125 Hz), were synchronized to external triggers recorded by the neural data acquisition system.

Neural data preprocessing

Spike sorting

We extracted for later analysis both single-unit (SU) and multiunit (MU) spike clusters from the stored broadband signal as follows. MU clusters consisted of identified spikes, which were not isolatable into distinct SU clusters. We used a three-step approach for extracting and clustering spikes, with a first pass using Kilosort (v.2.5)⁹⁵ to extract putative spike clusters, a second pass using a custom-written program to label these clusters as SU, MU or noise, and a final manual curation step. Although Kilosort classifies clusters, we did not use those labels.

For Kilosort, we used default parameters, except AUCsplit (0.90), Th ([6, 4]) and lam (10), which we optimized using parameter sweeps on data from representative sessions and by manual evaluation of results.

We next refined cluster labels. For each cluster, we removed outlier waveforms (exceeding a 3 times interquartile-range threshold for any of the minima, maxima or sum-of-squares). Waveforms were then shifted slightly in time if needed (<1 ms) to improve their alignment by troughs (or peaks, for positive-going waveforms). For each cluster, we computed two features: the signal-to-noise ratio (SNR) and inter-spike-interval violations (ISIVs). The SNR was defined as the ratio of the peak-to-trough difference (of the average spike waveform) tothe standard deviation (averaged across time bins). If a cluster contained both positive-going and negative-going waveforms, the SNR was computed separately for these two subsets of data and then averaged. ISIVs were defined as the fraction of inter-spike intervals less than a refractory period (1.5 ms). On the basis of these SNR and ISIVs, we provisionally classified clusters as SU (if either (SNR > 9.6 and ISIV < 0.05) or (SNR > 6.9 and ISIV < 0.01)), noise (SNR < 3.9) or MU (the remaining clusters).

We then manually curated these clusters. We visualized waveforms for every cluster to either confirm its label (MU, SU or noise) or to manually re-assign it to a different label (including artefact) using a custom-written MATLAB GUI. We also manually checked whether to merge multiple SU clusters on a single channel into a single SU cluster if they exhibited high waveform similarity, inversely correlated spike count frequency over the course of the session or a negative peak close to zero lag in a cross-correlogram of spike times. Finally, for each channel, all MU clusters were merged into a single MU cluster. Combining SU and MU, this process produced the following number of units per area (mean ± s.d. across sessions) for the different brain regions: for S1, M1 (59.9 ± 12.5), PMd (44.1 ± 6.2), PMv (34.2 ± 7.3), SMA (63.0 ± 7.9), preSMA (75.4 ± 17.7), dlPFC (47.8 ± 17.2), vlPFC (43.3 ± 9.9) and FP (19.2 ± 3.8); for S2, M1 (40.7 ± 13.1), PMd (54.7 ± 5.5), PMv (71.1 ± 6.6), SMA (53.4 ± 7.2), preSMA (57.9 ± 11.1), dlPFC (24.6 ± 4.8), vlPFC (38.6 ± 13.7) and FP (42.6 ± 5.0).

Converting spike times to firing rates

Single-trial spike trains were converted to firing rate functions by smoothing with a 0.025 s Gaussian kernel (0.01 s slide). We removed units with very low firing rates (if the 80th percentile of their firing rates across all trials and time bins was less than 1 Hz). We also removed units for which the firing rates were unstable, either due to high systematic drift in the firing rate over the session (m/u > 0.2, where m is the slope from regressing the square-root-transformed firing rate versus time (in hours), and u is the mean firing rate) or due to large fluctuations in the firing rate across the session. For the latter, we excluded units satisfying either (s_max – s_min)/s_mean > 1.15 or (u_max – u_min)/u_mean > 0.65, where the session is first split into contiguous 50-trial bins, the across-trial standard deviation in the square-root-transformed firing rate is computed in each bin, and then s_max, s_min and s_mean are defined as the maximum, minimum and mean standard deviation across bins, respectively; u_max, u_min and u_mean are defined similarly, except using the within-bin mean firing rate instead of the within-bin standard deviation. We then processed firing rates as follows. We first square-root transformed activity to normalize its variance. Following a common approach in analyses of population firing rates⁹⁶, we z-scored the activity of each unit to ensure that neurons with highly different firing rates contributed similarity to population analyses, but in a ‘soft’ manner so that higher-firing-rate neurons contributing relatively more:

$${x}_{i}^{{rm{norm}}}(t)=frac{{x}_{i}(t)-{mu }_{x}}{{sigma }_{x}+C}$$

where x_i(t) is firing rate for trial i at time bin t; µ_x and σ_x are the mean and standard deviation, respectively (across trials and time bins), and C is an additive factor to ensure softness, where C = min(m) + 3 Hz, where m is a vector of mean firing rates across units. All subsequent analyses used this normalized firing rate.

Time-warping neural activity to a common trial template

For the figure showing the average firing rates over the entire trial (Fig. 3c), we first time-warped each trial to a common trial template. We defined a set of events that occur across trials as anchors (fixation touch, image onset, go cue, finger raise off fixation, stroke onset, stroke offset, touch done button and reward). We included only single-stroke trials. We first generated a median trial. For each segment (that is, the time window between a pair of successive anchor events), we found its median duration and then concatenated these median segments to construct a median trial. We then aligned each trial to this median trial at the anchor events, warping time linearly in each segment. This warping did not change the firing rate values, just their timing. To avoid sharp discontinuities at anchor points, we smoothed the final firing rates at the times of the anchor points (2.5 ms Gaussian kernel).

Sorting of units (rows) in the resulting firing rate plot (Fig. 3c) was performed in a cross-validated manner. Sort indices were determined using one subset of trials (n = 50) and then applied to sort the remaining subset of trials that are plotted (n = 235).

Neural data analyses

PCA

We performed dimensionality reduction on the neural population activity, in general because high-dimensional noise can reduce the interpretability of the Euclidean distance⁹⁷, and in one case to identify a potential linear projection of population activity (that is, a subspace) that preferentially encodes primitives, a standard approach^28,98. We represent data for a single area from a single session and in a specific within-trial time window as a matrix X of size N × KT, where N, K and T are the number of units, trials and time bins, respectively, constructed by concatenating time bins from all trials along the second dimension. Data were first binned in time (0.15 s window, 0.02 s slide) before constructing this data matrix. Instead of applying PCA on single-trial data X, we applied PCA on trial-averaged data X_C, to minimize the influence of trial-by-trial variation (noise). X_C holds the mean activity for each trial condition, of size N × K_CT, where K_C is the number of unique conditions, with the specific conditions depending on the experiment (see below). We performed PCA on X_C and retained the top eight PCs. The specific trial-averaged conditions used for identifying PCs were the following. For analysis of motor invariance (Fig. 4), the conditions were each unique primitive (averaging over location or size), which resulted in identifying PCs that preferentially encoded primitives if such PCs exist. For analysis of categorical structure (Fig. 5), PCA was performed separately for each morph set, and the conditions were the unique images (that is, the two end point shapes and the morphed shapes in between). For the analysis of primitive representational reuse in characters (Fig. 6), the conditions were each combination of primitive and task type.

We performed PCA in a cross-validated manner to ensure that it was not overfitting to noise. We partitioned trials into two subsets (in a stratified manner): one training set that was used only for identifying the PCs and a test set that was projected onto these PCs and then used for all subsequent analyses. We performed eight randomized train–test splits (including all downstream analyses) and averaged their results.

Representing time-varying population activity as a vector

For some analyses, we captured single-trial time-varying population activity during the planning epoch (0.05–0.6 s after image onset) as a vector, which could then be visualized after dimensionality reduction (Extended Data Fig. 7) or used in decoding analyses (Extended Data Fig. 12). Starting from the population data of a region, represented as a K × N × T matrix, where K, N and T are the numbers of trials, units, and time bins, respectively, (with time binned using a 0.2 s window, 0.1 s slide), we concatenated the N channels’ length T vectors end-to-end to construct the data matrix D, of size K × NT, where each trial is represented by a vector of length N × T.

Nonlinear dimensionality reduction

To perform nonlinear dimensionality reduction of population activity to two dimensions (Extended Data Fig. 7), we used the uniform manifold approximation and projection (UMAP), performed on D, using values of 40 and 0.1 for the parameters n_neighbors and min_dist.

Computing the neural distance

To quantify the similarity of population activity between two sets of trials, such as trials for conditions A and B, each a specific conjunctive value of task-relevant variables, we devised the ‘neural distance’ metric. Inspired by a previously described ‘normalized distance’⁹⁹, it is the average pairwise Euclidean distance across conditions A and B, minus the average within-condition distance. This subtraction ensures the useful property that this distance is unbiased, in that the expected value of neural distance between two sets of trials sampled from the same distribution is zero (unlike the mean Euclidean distance, which is biased upwards²⁶). Moreover, the resulting distance is normalized by dividing by an upper-bound distance to normalize it between 0 and 1. The neural distance is defined as:

$${D}_{{AB}}^{* }={D}_{{AB}}-({D}_{{AA}}+{D}_{{BB}})/2$$

where the normalized Euclidean distance between sets of trials (indexed by i and j) in conditions A and B is:

$${D}_{AB}={leftlangle frac{1}{{d}_{max}(t)}{langle |{x}_{i}(t)-{x}_{j}(t)|rangle }_{iin A,jin B}rightrangle }_{tin {{t}_{1},ldots ,{t}_{n}}}.$$

Here, x_i(t) is the population activity vector at time t (in a window between times t₁ and t_n), and d_max is an upper bound (98th percentile) of the distances between all pairs of different trials combined across all conditions.

Computing the encoding strength of a variable

To compute how strongly a given variable is encoded in population activity (for example, primitive encoding in Fig. 4j), we computed the mean effect of that variable on population activity in terms of neural distance while controlling for the other relevant variables. Consider an experiment in which conditions vary along two variables, primitive and location, represented as the tuple (p, l), where p and l represent the primitives and locations, respectively. Primitive encoding is the average neural distance across all pairs of conditions that have different primitives but same locations:

$$text{primitive encoding},=,{langle {D}_{(p,l)({p}^{{prime} },{l}^{{prime} })}^{* }rangle }_{pne {p}^{{prime} },l={l}^{{prime} }}.$$

Location encoding is defined analogously:

$$text{location encoding},=,{langle {D}_{(p,l)({p}^{{prime} },{l}^{{prime} })}^{* }rangle }_{p={p}^{{prime} },lne {l}^{{prime} }}.$$

This approach generalizes to any pair of variables, such as primitive and task type in Fig. 6.

Statistically comparing brain regions in the strength of variable encoding

In analyses that compared the encoding strengths of a particular pair of variables (for example, primitive versus location in Fig. 4j), we used the following procedure to compare each brain region with every other brain region in how strongly they encode these two variables. (1) For each variable and pair of brain regions, we compared the strength of encoding of that variable in these regions. This involved first extracting a dataset of neural distances for each pair of trial conditions for each of the two brain regions. For example, if the variable was primitive, then each of the two brain regions would contribute a dataset consisting of neural distances for all pairs of trial conditions with different primitives but the same location. The datasets for these two regions would be combined and we then fit a linear least-squares regression model to test for an effect of brain region on neural distance y, controlling for the effect of trial–condition pair:

$$y={beta }_{0}+{beta }_{r}{X}_{r}+mathop{sum }limits_{j=1}^{{N}_{c}}{gamma }_{j}{Z}_{j}+{epsilon }$$

where β₀ is the intercept term, X_r is 0 or 1 depending on the brain region, and Z_j is an indicator variable representing the trial–pair condition, with γ_j as their coefficients, and ({epsilon }) is a noise term. Finally, we extracted the P value for β_r (two-sided t-test), which represents the significance of the difference between these two regions in encoding strength for this variable. (2) This procedure was performed once for each combination of variable and brain region pair, which resulted in 56 P values, corrected for multiple comparisons using the Bonferroni method (2 variables × 28 brain region pairs = 56 comparisons). (3) Using these 56 P values, we summarized each region with 2 numbers representing the number of regions in comparison to which this region more strongly encodes these two variables. For example, for primitive × location experiments, each region was scored with a tuple (N_prim,N_loc), where N_prim is the number of other regions this region beats in the pairwise tests of primitive encoding (and analogous for N_loc, except for location encoding). In summary plots (Figs. 4j and 6f), the resulting (N_prim,N_loc) tuples for each region are represented as colours using a two-dimensional colour map.

Specific analyses

Analysis of motor invariance in neural activity

Dimensionality reduction was performed as described above using a time window of 0.05–0.6 s after image onset (to avoid including data after the go cue, which, for a subset of these motor-invariance experiments, occurred 0.6–1.0 s after image onset) for fitting the PCs and for analyses that involve time-averaging (Fig. 4i–k).

To test cross-condition decoder generalization (Fig. 4k), we used a linear SVC with a one-versus-the-rest scheme for multiclass classification (LinearSVC, scikit-learn (v.1.3.0), regularization parameter C set to 0.1). We report the test accuracy, linearly rescaled so that chance level (inverse of the number of classes) and 1 were mapped to 0 and 1. Because decoders were trained and tested on different conditions (with non-overlapping trials), there was no concern of overfitting. Decoding was performed separately for, and averaged across, time bins (0.05–0.6 s relative to image onset).

Analysis of categorical structure in neural activity

Dimensionality reduction was performed as described above using a time window from 0.05 s to 0.9 s after image onset for fitting the PCs. For analyses involving time averaging (Fig. 5d–g), we used a window late in the planning epoch (0.6–1.0 s), when separation for A1 and A2 trials was the greatest (Fig. 5h). The primitive alignment index was computed as above using the Euclidean distance.

To assess whether primitives were associated with distinct activity patterns, we asked whether single-trial activity for each primitive was separable from the activity of every other primitive using a decoding approach (Extended Data Fig. 12). We used activity during the planning epoch (0.05–0.6 s after image onset) represented as a dataset D (size K × NT, where each trial is represented by a vector of length N × T; see the section ‘Representing time-varying population activity as a vector’), with its dimensionality further reduced to K × 50 by performing PCA and keeping the top 50 PCs. For each pair of primitives, decoding was performed using a linear SVC (LinearSVC, scikit-learn (v.1.3.0), regularization parameter C set to 0.1) in a cross-condition manner to assess generalizable decoding of primitives; on each iteration, we trained the model using data for one location (or size, for sessions with size variation) and tested using held-out data from all other locations (or sizes), and then averaging over analyses for different locations (or sizes). Because decoding can be biased upwards for regions with more units, to fairly compare regions that had the strongest primitive encoding (preSMA, SMA, PMv, PMd and vlPFC), we first matched their numbers of units by randomly subsampling (without replacement) to match the area with the fewest units and averaged the results from repeating this ten times. By performing this overall procedure for every pair of primitives and separately for every brain region, each time returning a single decoding accuracy score, we populated the matrices in Extended Data Fig. 12. We combined results across five sessions for each subject (S1, 2 varying in location, 3 varying in size; S2, 3 varying in location, 2 varying in size).

Analysis of recombination of primitive representations in the character task

We analysed primitives that were performed in both single-shape (instructed by the shape image) and character trials (the subject’s choice) in the same session. For character trials, we used only strokes scored as high-quality primitive matches. Dimensionality reduction was performed as above, using a time window of –0.8 to 0.3 s relative to the stroke onset for fitting the PCs. For analyses involving time averaging, we used a window –0.5 to –0.05 s relative to the stroke onset. Neural distance, primitive encoding and task-type encoding were computed analogous to above, using primitive and task type (instead of location or size) as the two variables.

For analyses comparing strokes from the single-shape task to the non-first stroke of the character and multishape tasks (Extended Data Fig. 14), all methods were the same as in the main analysis comparing single-shape strokes to the first stroke of character drawings except two. First, we used a shorter analysis time window of –0.35 to –0.05 s relative to the stroke onset (instead of –0.5 to –0.05 s) to match the duration of the gap between strokes (mean of around 0.3 s) in character drawings. Second, we preprocessed neural data to account for a strong effect of the reaching movement to initiate the trial (that is, reaching from the hold button towards the image; see Fig. 1c). This effect is clear in population trajectories (Extended Data Fig. 14a). To accurately assess whether primitive-encoding activity was reused across task types, it was important to correct for this effect of initial reach because it was present for single-shape trials but not for the non-first strokes of the character and multishape tasks. We performed this correction in a simple manner by subtracting, for all strokes, the across-stroke mean effect of the initial reach. This correction was performed separately for each time bin as follows. For each unit, we estimated the mean effect of initial reach by using linear least-squares regression:

$$y={beta }_{0}+{beta }_{f}{X}_{f}+{beta }_{t}{X}_{t},+mathop{sum }limits_{j=1}^{{N}_{p}}{gamma }_{{j}},{Z}_{j}+{epsilon }.$$

Here, y is the firing rate, β₀ is the intercept term, β_t is the effect of task type, X_f is 0 or 1 depending on whether the stroke is the first stroke (always 1 for single-shape data), X_t is 0 or 1 depending on the task type (single-shape or character), Z_j is an indicator variable for the primitive class, and ({epsilon }) is a noise term. To correct for the effect of initial reach, we subtracted β_f (the mean effect of initial reach) from neural activity for all first strokes. We did not perform this correction for comparisons of single-shape strokes to the first stroke in character drawings (Fig. 6) because, there, all strokes included the initial reaching movement.

Analysis of primitive encoding aligned to visual fixations

We assessed the extent to which PMv activity, aligned to visual fixation events during the multishape task, preferentially encodes either the primitive associated with the visually fixated shape or the first primitive the subject is planning to draw (Extended Data Fig. 15). We first converted time-varying eye tracking data (x and y position time series) to a sequence of fixation and saccade events using Cluster Fix¹⁰⁰, which, in brief, uses k-means clustering on distance, velocity, acceleration and angular velocity, and then assigns clusters to fixation and saccade events. Each fixation event was assigned a visually fixated primitive label, defined as the closest shape. If all shapes were further than 70 pixels (2.08 cm), then this fixation event was considered to be looking away from all shapes and was therefore excluded from further analysis. Each fixation event was also assigned a planned primitive label, based on which primitive the subject would draw first on that trial.

We assessed the extent to which fixation-aligned neural activity encoded each primitive using a decoding approach, whereby the strength of representation of a given primitive was defined as the probability score returned by a decoder trained to classify that primitive versus all other primitives. Decoders were trained using single-shape trials from the same session. Data from the planning epoch were cut into multiple short-duration snippets, each a data point used for training the decoder. Specifically, neural data were represented as a matrix X of size N × KT, where N, K and T are the number of units, trials and time bins, respectively, constructed by concatenating time bins from all trials along the second dimension (data were first binned in time using a 0.3 s window with 0.1 s slide) before constructing this data matrix. These KT training data points (each N-dimensional), each with an associated primitive label, were used to train a multilabel, logistic regression, one-versus-rest classifier (scikit-learn), which pools multiple independent classifiers, one per primitive. This resulted in a multilabel decoder that could be applied to neural data at any time point to return one probability score per primitive.

Analysis of encoding of kinematic variables in neural activity

To test for encoding of generic kinematics (Extended Data Fig. 16), we used a standard encoding model approach, which was based on a previous study of handwriting kinematics encoded in the human motor cortex¹⁰¹. We assessed the fraction of variance in neural activity explained by a linear mapping from moment-by-moment finger kinematics to neural activity. As in that study¹⁰¹, we used activity in the top ten dimensions found after performing PCA. PCA was performed as described above, except, instead of identifying the PCs using trial-averaged data, we did so using single-trial data to retain activity potentially related to trial-by-trial variation in kinematics. Activity at each time point was modelled as:

Here f_t is the neural activity at time bin t in the top 10 neural PCs, E is a 10 × 2 matrix mapping kinematics to neural activity (that is, preferred directions), v_t is the 2 × 1 finger velocity, and b is an intercept term. The fraction of variance accounted for (FVAF) was computed as:

$${rm{F}}{rm{V}}{rm{A}}{rm{F}}=1-frac{{{rm{S}}{rm{S}}}_{text{err}}}{{{rm{S}}{rm{S}}}_{text{tot}}}=1-frac{{sum }_{t=1}^{T}{(E{v}_{t}-{f}_{t})}^{{rm{T}}}(E{v}_{t}-{f}_{t})}{{sum }_{t=1}^{T}{f}_{t}^{{rm{T}}}{f}_{t}}.$$

Here, SS_tot is the total variance, SS_err is the sum of squared errors, and T is the total number of time steps across all trials. Note that T (upright) represents the matrix transpose operator.

To test how strongly activity encoded kinematics in a manner generalizing across primitives, models were trained on one subset of primitives (all except one) and tested on one held-out primitive. This was performed once for each primitive, with the final FVAF taken as the mean FVAF across all train–test splits. We used sessions performing the single-shape task with location and size variation (S1, n = 30 primitives across 5 sessions; S2, n = 42 primitives across 3 sessions, which excludes 2 sessions with fewer than 6 primitives). We included data from throughout the stroke duration.

To allow for the possibility that kinematics better relate to neural activity at a non-zero lag, we performed this analysis multiple times with different time lags between neural and behavioural data, varying from –0.3 to 0.3 s (0.05 s increments). The final scalar summary averaged the results in a time window from –0.15 s to –0.05 s (with neural leading behaviour), consistent with a peak lag of around –0.1 s (neural leading) found in previous studies¹⁰² and in our analysis (Extended Data Fig. 16).

Note that the FVAF values we found in M1 are comparable to the average value of 0.3 in a previous study of the motor cortex in human handwriting¹⁰¹. Our finding of lower values (ranging from around 0.0 to 0.2, Extended Data Fig. 16b) is consistent with our use of trial-level instead of trial-averaged data (which introduces more variability) and of testing generalization across primitives (note that if we instead test generalization to held-out trials, this results in higher FVAF values ranging from around 0.1 to 0.3).

Statistics and reproducibility

The findings from this study resulted from experiments that produced similar results across multiple independent repetitions: location invariance (two sessions for S1 and three for S2); categorical structure (three sessions for S1 and four for S2); and recombination in characters (nine sessions for S1 and nine for S2). Randomization was performed by having all comparisons between experimental conditions done within the same animal, with all experimental conditions (task variants × stimuli) presented randomly. No blinding in group allocation was necessary as each subject was tested in all experimental conditions. Blinding of the subject and experimenter during data collection was effectively implemented owing to the randomization and balancing of conditions across trials. All behavioural and neural analyses were performed using custom-written Python (v.3.8) code unless otherwise noted, incorporating the analysis and plotting libraries numpy (v.1.24.3), scipy (v.1.10.1), scikit-learn (v.1.3.0), pandas (v.2.0.3), seaborn (v.0.12.2), elephant (v.1.0.0) and statsmodels (v.0.14.0).

The following are detailed descriptions of statistics that did not fit in the figure legends.

For Fig. 2e, n = 648 (same shape, size and location: YYY), 1,296 (YYN), 1,296 (YNY), 2,592 (YNN), 5,184 (NYY), from a pool of 729 trials (81 conditions × 9 trials each). Two-sided Wilcoxon signed-rank tests comparing NYY to others: versus YNY (W = 0, ***P = 5.36 × 10⁻¹⁵), versus YYN (W = 0, ***P = 5.36 × 10⁻¹⁵), versus YNN (W = 0, ***P = 5.36 × 10⁻¹⁵); n = 81 shape, size and location conditions.

For Fig. 2j, test for sigmoidal nonlinearity: ^###P = 1.91 × 10⁻⁶, two-sided Wilcoxon signed-rank test (W = 0, n = 20) that drawing < image (U1) and drawing > image (U2). Test for trial-by-trial switching: ***P = 1.91 × 10⁻⁶, two-sided Wilcoxon signed-rank test (W = 0, n = 20) that A2 > A1 (drawing).

For Fig. 2p, for remix primitive sets, three independent simulations are plotted separately. Wilcoxon signed-rank tests: results were highly significant when comparing match to one’s own primitives versus other. For S1 data, S1 versus S2 (W = 0, ***P = 6.75 × 10⁻²⁵), versus S1 remix (W = 125, ***P = 3.43 × 10⁻²²), versus S2 remix (W = 0, ***P = 8.75 × 10⁻²⁵). For S2 data, S2 versus S1 (W = 49, ***P = 6.26 × 10⁻²³), versus S2 remix (W = 424.5, ***P = 3.61 × 10⁻¹⁶), versus S1 remix (W = 0, ***P = 5.82 × 10⁻²⁵). Results were also highly significant when comparing one’s own remixed primitives versus others. For S1 data, S1 remix versus S2 (W = 995, ^###P = 1.41 × 10⁻¹¹), versus S2 remix (W = 351, ^###P = 1.90 × 10⁻²⁰). For S2 data, S2 remix versus S1 (W = 731.5, ^###P = 1.09 × 10⁻¹¹), versus S1 remix (W = 1, ^###P = 2.85 × 10⁻²⁰). For remixed primitives, the least significant of the three simulations is shown. Sample sizes were identical for all tests (n = 141 characters, combining novel and practised).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.