Monday 17 October 2022

Judging in PS - proposal

In my previous post I considered how PS can be judged in a competitive setting. Is there a more definitive way of considering PS that divides its elements in a more logical fashion?

In my post on approaching technical PS, I mentioned that spinning can be considered at various depths - on a basic level as core parameters and superficially as what tricks were done, with increasing depth of consideration as to how links complement each other to create effect, and how the ideas shown in the combo relate to other existing combos, and so on. What happens if we try this approach for the 3 ‘main’ concrete criteria - execution, difficulty, and originality? 


The first idea I had was a direct application of this to the combo itself - i.e. considering on ‘micro’ scale i.e. trick technique/separate trick and link scale, and ‘macro’ scale i.e. what each link contributes to the combo, and how the combo contributes to the other combos that spinner made, the other combos in that tournament, and all existing combos. Personally, I found this way of considering PS to be extremely useful in evolving my spinning from 2016 to 2021 when I was training seriously, and allowed me to join the dots of the abstract vision in my head to create actual combos that progressed along the path I dreamt of. The problem with this combo-based depth division for practical judging is that it obscures some basic things like ‘is the overall hardness of the combo micro or macro?’ 


The solution to this problem is to use the depth consideration on the 3 ‘main’ criteria themselves, rather than on the links in the combo. What do we end up with then?


Execution-related


Superficial: Technique perfection - whether the tricks are performed properly (in regards to the mod’s rotations, its position to the fingers, whether the accelerations and decelerations are performed well). Of course, technique level is a lot more nuanced than this: think of Hash, Noel, Dary as aesthetic-based examples; or specific ways of performing hard tricks like doing PD fl around rev with minimal hand movement, pen spinning parallel to the ground as a technical-based example. To accommodate higher level technique, higher scores can be given for exceptional mastery in this subcriteria.


Deep: Effect - how visual elements come together to create the overall impressions of the combo. Note visual elements are not necessarily what one finds subjectively appealing (or even about whether the technique in performance is better). For example, Nine's WC22 R2 and Saltient's WT21 R4 have distinct impactful use of visual elements while not necessarily having the most perfect technique. Depending on the event, less intuitive evolution of effects like Dary or Noel's JEB Spinfest 2019 that require more specific thought to implement may be considered over more intuitive evolution of effects that require less specific experimentation.


Difficulty-related


Superficial: Hardness - how hard the breakdown is. Consider the mechanics shown and the precision required (whether it be margins of error over rotation speed, position of the mod etc) or specific movements in tight timeframes (links that require control over ¼ rotations like moonwalk inverse side sonic with minimal change in palm orientation, or the fast hold-release transitions in palm down 1p2h twirl fall). Since this subcriteria only considers how hard the breakdown is, it does not penalise a combo made of filler - short hard sequence compared to a combo with same overall hardness but even distribution.


Deep: Density - how the various mechanics interact with each other and their arrangement throughout the entire combo. Are there any filler moments where the chain of rising difficulty is broken? The range of skill sets mastered and whether there are different interactions beyond those required in learning the separate tricks are also considered (e.g. pinky bust cardioid - seasick - curled pinky bak 1.5 in my WT21 R3 requires learning different mechanics beyond the separate tricks). I considered renaming this subcriteria due to the confusion and arguments ‘density’ created in the past, but regardless of what name is chosen, it still makes sense to consider the deeper elements of difficulty like this.


Originality-related


Superficial: Novelty - whether the material, or similar material, has been done before and how often it was done before. While this has a risk of judges forgetting old videos, and has theoretical risks of encouraging people to hide material, delete old content, or send subpar collab submissions etc, as the most basic way of considering originality it makes sense. The considerations in releasing material are something people of many artforms and professions deal with on a daily basis.


Deep: Conception - whether there are deeper overarching abstract ideas explored or conveyed. This is potentially the most nebulous criteria, because it deals with perceived intent, which can vary depending on the audience, the competitor’s posted explanations, previous combos. In theory, it’s possible a very experienced judge may see a deeper significance behind the combo that the competitor missed. While very few spinners or combos step into this territory (as such, the majority of submissions will score quite low in this area), the works that reach it can create new paths or new ways of considering PS - e.g. OhYeah's WPSAL 2017 1p2h for 1p2h interactions, or Saltient's WT21 combos for visual structures and alien impressions, or RPD's PSO20 multipen for creating 1p1h-like flow with 2p1h mechanics.


Some of the combos I put most work into - my WT21 R5 for exploiting the properties of power and timing-based difficulty, my WT19 R5 as a condensation of complex mechanics aimed to step into this territory. Another combo of mine that touches on conception, albeit at shallower depth, is my 2p2h combo in tag with Supawit which explored a new form of transfers. Initially, I only saw these transfers as cool visual effects rather than as a generalisable solution for ‘how do I make transfer of each pen to different hand without mods leaving contact of the hand’.


Presentation will remain as before i.e. depending on how detrimental the presentation is to understanding what’s going on in the video, up to -2 points. Of course, choice of background colour, lighting, mod colour, filming angle play a huge role in the final effects and impressions - worth far more than 0-2 points. A larger variety of material requires an appropriate choice of an ever increasing range of setups. It is worth considering whether the detrimental aspects of presentation can be the ‘superficial’ consideration, with the ‘deep’ consideration being how setup relates to choice and performance of the material. The problem with putting more weight into presentation is that out of all the subcriteria, it is likely the most prone to excusable arbitrary exploitation.


If you have read up to here, then I’m very grateful. While it’s unlikely you will agree with everything said here (it would be strange if you did), I hope my thoughts will leave an impression on you and provoke some thinking of your own.


To finish off, I'll give what is probably my favourite quote - a reply by ZUN (the creator of the Touhou project series, who is self-taught in both music composition and computer programming) when asked what made Super Mario Brothers and Street Fighter 2 notable when they released:


“Those games were revolutionary because they had things like different systems from games before them, creating new atmospheres within themselves. Later, people would say stuff like "that game engine was revolutionary" or "the characters had a lot of appeal", but at the time, no one really thought about the individual aspects because they were too busy playing. Games don't become hits because of those kinds of reasons. The systems in those games weren't just the pinnacle of all the games made up to that point, there was also a decisive difference. If I had to put it into words, I would say they "created a new world". Though it's a little different from the usual meaning, let's just go with that.” - ZUN.


Will your spinning create a new world? It's up to you.


Judging in PS - considerations

Hello everyone, it’s been a while. While I don’t train that seriously nowadays, I still pick up the mod pretty often and think about PS quite a lot, and gained some interesting insights from reading about medical education as well. This is the first post about PS judging in this 'series', an attempt at addressing many of the questions raised is in the second post.

So PSO22 is coming around and it will try yet again a slightly different way of judging. Anyone who’s been around the competitive PS scene for a few years will know the countless discussions (in worse cases - drama, arguments, salt, grudges) surrounding any attempt to assess our artform.


One can ask whether PS (or arts in general) should be competitive, or have criteria, or have numerical scores - unfortunately, human nature dictates that humans are competitive, and competitions, criteria, and scoring exist in PS and other arts, for better or worse. So let’s move onto more practical questions.


The most important question is ‘what purpose does PS competition serve?’ This is the most important question because PS competitions exist to address this.


Q: Is it to award a title to someone? 


A: We will never agree who the #1 is, because different people prioritise different elements of PS (be it finesse of technique, technical skill, innovation or other things). A title in itself has reduced meaning if it has been awarded through unreliable/invalid methods of assessment, or if competitors are not all ‘serious’ about preparation. Unlike professional sports or arts, a title in PS is not related to one’s primary income. Nonetheless, because competition exists, communities have been trying different ways of assessing PS for competing. 


Q: Is it to define what makes a better spinner or a better combo?


A: On a general level, it’s easy to define a ‘good combo’ - all its elements contribute to visuals and mechanics: combos that don’t do this will have wasteful or detrimental material. Of course, individuals may disagree on whether certain elements contribute in a positive way. It follows that a ‘good spinner’ is someone who makes many ‘good combos’. Do competitions exist merely to be satisfied with attaining ‘good’ rather than ‘exceptional’ or ‘groundbreaking’?


Q: Is it to promote activity and progression in the hobby?


A: I feel this is getting closer than the previous 2 questions. Direct competition fuels improvement and drives people to experiment outside of their comfort zones, in a way that PS collabs do not seem to. While there have been many historic CV combos in the aesthetic sense, the majority of groundbreaking combos in more technical (i.e. material and theory-based) aspects have been in tournaments. A perfectly disciplined human would continue pushing themselves in the same way regardless of whether an easily tangible goal like CV/tourney/solo exists, but there are no perfect humans. The excitement and discussions about tournament submissions and results also increase activity.


If we summarise the above, we end up with ‘competition should reward different aspects of PS in the many ways one can make a good combo, while giving further rewards to people who push the boundaries’. Perhaps personal projects like solo videos are more suited to experimental revolutionary material, but it is illogical if the highest level competitive event of our artform does not differentiate revolutionary combos or revolutionary spinners.


Before I discuss the system I want to try in the themes I’m judging for PSO22 (which can be generalised for themeless battles like WT), I will discuss what has been tried or suggested before, but didn’t work that well.


Q: Why can’t we just use comments only, no subcriteria?


A: In a world with great judges, PS competitions would produce reasonable winners with comments only. We don’t live in a perfect world. While numbered scores are arbitrary in their divisions and judges may not follow the example videos for what a certain score represents, a comments-only system will change those 5-7 arbitrary numbers into 1 large arbitrary result (i.e. the vote towards who wins). This works if we trust that the judges represent the views we desire for that competition.


It’s easier to think about a recent example - it’s justifiable to vote Mond over Saltient in WT21 R3 by prioritising execution. On a personal level, it is equally valid to prioritise basic control, or finesse of technique, or technical difficulty, or novel material. But how well does this align with what international PS competitions aim to do? Would voting Mond in a comments-only judgment promote spinners to continue pushing boundaries, or does it encourage spinners to stay in their comfort zones? While subcriteria cannot stop this (and should not explicitly stop specific judgments), judges should be held accountable for considering the specified elements. Comments-only does not address disagreements about judges overly prioritising certain aspects of spinning, nor does it address different understandings of elements like ‘structure’ or ‘creativity’. Having no subcriteria would make criticisms harder to specify, since the judge can just brush it off with ‘my general impression was this, I already explained myself, I define this element differently to how you do’ etc.


There have been events with comments only judging, where judges provide examples of combos they prefer (which may work for smaller events and was done in one JEB tournament before), but for an international event this risks suggesting spinners should try to replicate existing impressions rather than create more evolved versions of their own paths. In past arguments over numbered scores, disagreement was often about the outcome rather than the scores (as in, even when reasonable detailed justifications above and beyond the initially submitted comments were given, people still had complaints) - comments-only would not fix the fact people get annoyed over their friend or favourite not winning, since this is an issue of sportsmanship and maturity. 


Q: What if we make judges assess some varied, tough combos before they are allowed to judge the real event?


A: This was tried in WT21 judge selections. It didn’t work as well as expected. From previous events e.g. WT17 and WT19, as the tournament progresses (i.e. judges get more experience judging the spinners in the event, there are less combos sent = more time spent assessing each combo, more detailed comments), judging appears to improve in quality. Humans pay more attention to their performance when they know they are being assessed. While making sample judgments helps exclude some blatantly ‘off’ judgments, it isn’t that great at stopping strange judgments in the first half of a WT. Judge selection is surely more influential in determining results than the fine details of the criteria itself, but is harder to deal with. Judgments in previous events are the best determiner. Sample judgments still have a role in assessing new candidates, while allowing discussion before the actual event.


Q: Assuming one concedes the above and agrees to using subcriteria, why should they be given numerical scores? Numbers are annoying, introduce more variability and create strange inconsistencies.


A: Judges could be instructed to consider and comment specifically on various subcriteria in their comments-only judgments. However, it would be impossible to know whether the final win/loss vote appropriately accounted for those elements; or the judge could make a final vote contrary to what the tone of their comments indicates.


A system where the judge votes which spinner did better in a given subcriteria e.g. spinner A is better in exec, spinner B is better in difficulty etc, then totals those votes (perhaps with weighting for whatever aspects the tournament or theme wants to prioritise) could be tried. However, this would not account for large gaps in respective elements. This was tried in SCT18 and worked well since the competitors who passed the qualification round were all solid, but for a larger event with more varied submissions, I doubt it would work well. It’s strange if someone submits a barely landed combo, or an extremely easy combo, or an extremely uncreative combo (e.g. 2/10 vs 8/10 in current scoring), while suffering the same penalty as someone who sends something that is just slightly worse (7/10 vs 8/10). 


If we are to use numbers, it may be helpful to have less subcriteria, with more unified score weightings (e.g. all subcriteria scored out of 5, rather than having some be out of 10, some be out of 5, some be out of 3. If certain elements are to be prioritised, it’s easier to just put in a separate multiplier afterwards). In a talk about how communication and professionalism can be assessed in medical students (given to my uni’s medicine faculty by a professor of medical education), using more detailed subcriteria served to annoy the examiners while making the results less reliable on statistical analysis. WT19 criteria had too many subcriteria (with some unintuitive definitions that overlapped in some regards).


Q: Why don’t we try to get consensus about the criteria by asking a lot of representatives from different countries?


A: This approach works in established fields with established theories and established experts (who usually conduct such discussions in a mutually understood language), and has been done for many curriculum, guidelines and regulations in professional fields. Even if we could overcome the language barriers between countries, PS still struggles to explain many foundational concepts in any given language - e.g. the English-speaking community is still struggling to express ‘good structure’ or ‘good pacing’ in words. While there are many established 'good' spinners, they may not be able to express their understanding in words, they probably do not agree with each other (and may never agree with certain other established spinners in the discussion), nor may their views align with what the tournament aims to prioritise.


There are many ways to describe important elements, which in turn have varying overlap, which then have varied practicality when being used as subcriteria. Fortunately, the discussions stemming from previous competitions serve as a good proxy for this topic. Unfortunately, these past discussions tell us that we are unlikely to achieve a consensus. There have been arguments for years that come up again every time we have a world event about: weight of execution-related elements, how one assesses difficulty, what degree or kinds of repetition are bad structure, what the penalty for reusing identical or similar linkages should be (and where the material was previously shown), what constitutes good flow, and so on. More discussion is a good thing, but the impression I got is that we’ve ended up repeating old points without introducing any new helpful ideas.


Q: I feel [insert element here] is important. Why shouldn’t it be a subcriteria?


A: There are many ways to break up the puzzle pieces of what a ‘good combo’ consists of. However, just because a certain way of grouping certain elements is a good term e.g. ‘pacing’, ‘tension’, ‘structure’, ‘coherence’, or ‘refinement’, does not mean it is good as a division of subcriteria. To elaborate, ‘structure’ can be considered as arrangement of mechanics (like ‘density’), arrangement of visual impressions (like ‘effect’), arrangement of new ideas (like ‘integration’). While ‘structure’ is a useful way of considering PS, it is hard to use as a subcriteria. A similar point can be made about ‘density’, ‘integration’, ‘effect’, which were part of the WT19 subcriteria under ‘effectiveness’. Similarly, ‘coherence’ overlaps with ‘structure’ and is influenced by ‘pacing’ and ‘tension’. ‘Pacing’ can be considered as use of speed, effect of movements of hand, wrist and mod, visual effect of the mod during the tricks performed, which in turn is influenced by the angle chosen and background/mod colours etc. While many abstract terms are useful for general discussion, is there a more practical way of dividing things for judging purposes? 


Q: So you’ve raised all these criticisms but what’s your constructive proposal? If you are only criticising but not making active suggestions, what’s the point?


A: If you’ve read up to here, congratulations! Now we can move onto my suggestions for addressing a lot of these things: judging proposals. I won’t claim there is any definitive ‘solution’ since there surely isn’t one, and there won’t be a way to satisfy everyone, but at least I can offer what is (probably) a more practical way of breaking things down.