测试代码:
Grumpier Old Men (1995) Death Note: Desu nôto (2006–2007) Irwin & Fran 2013 9500 Liberty (2009) Captive Women (1000 Years from Now) (3000 A.D.) (1952) The Garden of Afflictions 2017 The Naked Truth (1957) (Your Past Is Showing) Conquest 1453 (Fetih 1453) (2012) Commune, La (Paris, 1871) (2000) 1013 Briar Lane
返回:
1995 2006 2013 2009 1952 2017 1957 1453<-- 1871<-- <—- There is nothing here as 1013 is not a year
正如你从上面看到的,标题中的2个给出了错误的结果。1013 Briar Lane是正确的,因为它没有返回任何内容,因为1013不是年份
这是我的代码:
import pyspark.sql.functions as F from pyspark.sql.functions import regexp_extract,col bracket_regexp = "((?<=\()\d{4}(?=[^\(]*$))" movies_DF=movies_DF.withColumn('yearOfRelease', regexp_extract("title", bracket_regexp + "|(\d{4}$)", 0)) movies_DF.display(10000)
在标题子字符串中获取正确的年份