XML包被广泛应用于整个R宇宙.
因此,请将此视为跟进帖子和/或参考,并提供有希望的信息,并简要说明问题.
问题
解析XML / HTML文档之后,可以使用XPath进行搜索,需要内部使用C指针(AFAIU).而且看起来至少在MS Windows(我在Windows 8.1,64位上运行)这些引用没有被垃圾回收器正确识别.因此,消耗的存储器未被正确地释放,这导致在某个时刻冻结R过程.
至今中央调查结果
对我来说,XML:free和/或gc在通过xmlParse或htmlParse解析XML / HTML文档时,无法识别所有内存,并随后用xpathApply等进行处理:
所报告的操作系统任务(Rterm.exe)的内存使用量显着加快,而R进程的报告内存“R内”(功能内存大小)从中可以适度增加(相比之下).在下面的实质解析周期之前和之后查看列表元素mem_r,mem_os和比率.
总而言之,扔在推荐的所有东西(free,rm和gc)中,当调用xmlParse等时,内存使用总是会增加.这只是一个多少的问题.所以IMHO还是有一些不能正常工作的东西.
插图
我借鉴了Duncan的Omegahat git repository的分析代码.
一些准备:
Sys.setenv("LANGUAGE"="en")
require("compiler")
require("XML")
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252
attached base packages:
[1] compiler stats graphics Grdevices utils datasets methods
[8] base
other attached packages:
[1] XML_3.98-1.1
我们需要的功能:
getTaskMemoryByPid <- cmpfun(function(
pid=Sys.getpid()
) {
cmd <- sprintf("tasklist /FI \"pid eq %s\" /FO csv",pid)
mem <- read.csv(text=shell(cmd,intern = TRUE),stringsAsFactors=FALSE)[,5]
mem <- as.numeric(gsub("\\.|\\s|K","",mem))/1000
mem
},options=list(suppressAll=TRUE))
memoryLeak <- cmpfun(function(
x=system.file("exampleData","mtcars.xml",package="XML"),n=10000,use_text=FALSE,xpath=FALSE,free_doc=FALSE,clean_up=FALSE,detailed=FALSE
) {
if(use_text) {
x <- readLines(x)
}
## Before //
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_1 <- memory.profile()
mem_before <- list(mem_r=mem_r,mem_os=mem_os,ratio=mem_os/mem_r)
## Per run //
mem_perrun <- lapply(1:n,function(ii) {
doc <- xmlParse(x,asText=use_text)
if (xpath) {
res <- xpathApply(doc=doc,path="/blah",fun=xmlValue)
rm(res)
}
if (free_doc) {
free(doc)
}
rm(doc)
out <- NULL
if (detailed) {
out <- list(
profile=memory.profile(),size=memory.size()
)
}
out
})
has_perrun <- any(sapply(mem_perrun,length) > 0)
if (!has_perrun) {
mem_perrun <- NULL
}
## Garbage collect //
mem_gc <- NULL
if(clean_up) {
gc()
tmp <- gc()
mem_gc <- list(gc_mb=tmp["Ncells","(Mb)"])
}
## After //
mem_os <- getTaskMemoryByPid()
mem_r <- memory.size()
prof_2 <- memory.profile()
mem_after <- list(mem_r=mem_r,ratio=mem_os/mem_r)
list(
before=mem_before,perrun=mem_perrun,gc=mem_gc,after=mem_after,comparison_r=data.frame(
before=prof_1,after=prof_2,increase=round((prof_2/prof_1)-1,4)
),increase_r=(mem_after$mem_r/mem_before$mem_r)-1,increase_os=(mem_after$mem_os/mem_before$mem_os)-1
)
},options=list(suppressAll=TRUE))
结果
情景1
快速的事实:启用垃圾回收,XML文档被解析n次,但不通过xpathApply进行搜索
注意OS内存与R内存的比例:
之前:1.364832
之后:1.322702
res <- memoryLeak(clean_up=TRUE,n=50000)
save(res,file=file.path(tempdir(),"memory-profile-1.rdata"))
> res
$before
$before$mem_r
[1] 37.42
$before$mem_os
[1] 51.072
$before$ratio
[1] 1.364832
$perrun
NULL
$gc
$gc$gc_mb
[1] 45
$after
$after$mem_r
[1] 63.21
$after$mem_os
[1] 83.608
$after$ratio
[1] 1.322702
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7387 7392 0.0007
pairlist 190383 390633 1.0518
closure 5077 55085 9.8499
environment 1032 51032 48.4496
promise 5226 105226 19.1351
language 54675 54791 0.0021
special 44 44 0.0000
builtin 648 648 0.0000
char 8746 8763 0.0019
logical 9081 9084 0.0003
integer 22804 22807 0.0001
double 2773 2783 0.0036
complex 1 1 0.0000
character 44522 94569 1.1241
... 0 0 NaN
any 0 0 NaN
list 19946 19951 0.0003
expression 1 1 0.0000
bytecode 16049 16050 0.0001
externalptr 1487 1487 0.0000
weakref 391 391 0.0000
raw 392 392 0.0000
S4 1392 1392 0.0000
$increase_r
[1] 0.6892036
$increase_os
[1] 0.6370614
情景2
快速的事实:启用垃圾收集,明确地被调用,XML文档被解析了n次,但没有通过xpathApply进行搜索.
注意OS内存与R内存的比例:
之前:1.315249
之后:1.222143
res <- memoryLeak(clean_up=TRUE,free_doc=TRUE,"memory-profile-2.rdata"))
> res
$before
$before$mem_r
[1] 63.48
$before$mem_os
[1] 83.492
$before$ratio
[1] 1.315249
$perrun
NULL
$gc
$gc$gc_mb
[1] 69.3
$after
$after$mem_r
[1] 95.92
$after$mem_os
[1] 117.228
$after$ratio
[1] 1.222143
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7454 0.0000
pairlist 392455 592466 0.5096
closure 55104 105104 0.9074
environment 51032 101032 0.9798
promise 105226 205226 0.9503
language 55592 55592 0.0000
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8848 0.0001
logical 9141 9141 0.0000
integer 23109 23111 0.0001
double 2802 2807 0.0018
complex 1 1 0.0000
character 94775 144781 0.5276
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000
$increase_r
[1] 0.5110271
$increase_os
[1] 0.4040627
情景3
快速的事实:启用垃圾回收,明确地调用自由,XML doc被解析n次,并通过xpathApply每次进行搜索.
注意OS内存与R内存的比例:
之前:1.220429
之后:13.15629(!)
res <- memoryLeak(clean_up=TRUE,xpath=TRUE,"memory-profile-3.rdata"))
res
$before
$before$mem_r
[1] 95.94
$before$mem_os
[1] 117.088
$before$ratio
[1] 1.220429
$perrun
NULL
$gc
$gc$gc_mb
[1] 93.4
$after
$after$mem_r
[1] 124.64
$after$mem_os
[1] 1639.8
$after$ratio
[1] 13.15629
$comparison_r
before after increase
NULL 1 1 0.0000
symbol 7454 7460 0.0008
pairlist 592458 793042 0.3386
closure 105104 155110 0.4758
environment 101032 151032 0.4949
promise 205226 305226 0.4873
language 55592 55882 0.0052
special 44 44 0.0000
builtin 648 648 0.0000
char 8847 8867 0.0023
logical 9142 9162 0.0022
integer 23109 23112 0.0001
double 2802 2832 0.0107
complex 1 1 0.0000
character 144775 194819 0.3457
... 0 0 NaN
any 0 0 NaN
list 20174 20177 0.0001
expression 1 1 0.0000
bytecode 16265 16265 0.0000
externalptr 1488 1487 -0.0007
weakref 392 391 -0.0026
raw 393 392 -0.0025
S4 1392 1392 0.0000
$increase_r
[1] 0.2991453
$increase_os
[1] 13.00485
我也尝试过不同的版本.嗯,我试着尝试;-)
从源头,从omegahat.org
FYI:最新的Rtools 3.1安装并包含在Windows PATH中(例如安装stringr,源代码的工作正常).
> install.packages("XML",repos="http://www.omegahat.org/R",type="source")
trying URL 'http://www.omegahat.org/R/src/contrib/XML_3.98-1.tar.gz'
Content type 'application/x-gzip' length 1543387 bytes (1.5 Mb)
opened URL
downloaded 1.5 Mb
* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB,LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration Failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring prevIoUs 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
The downloaded source packages are in
'C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\downloaded_packages'
Warning messages:
1: running command '"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" CMD INSTALL -l "R:\home\apps\lsqmapps\apps\r\R-3.1.0\library" C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/downloaded_packages/XML_3.98-1.tar.gz' had status 1
2: In install.packages("XML",repos = "http://www.omegahat.org/R",:
installation of package 'XML' had non-zero exit status
Github上
我没有按照README的github回购建议,因为它指向this directory,只包含一个3.94-0版本的tar.gz(当时我们在CRAN是3.98-1.1).
即使说gihub repo不是一个标准的R包结构,我试过它,无论如何,install_github – 并且失败;-)
require("devtools")
> install_github(repo="XML",username="omegahat")
Installing github repo XML/master from omegahat
Downloading master.zip from https://github.com/omegahat/XML/archive/master.zip
Installing package from C:\Users\RAPPST~1\AppData\Local\Temp\RtmpQFZ2Ck/master.zip
Installing XML
"R:/home/apps/lsqmapps/apps/r/R-3.1.0/bin/x64/R" --vanilla CMD INSTALL \
"C:\Users\rappster_admin\AppData\Local\Temp\RtmpQFZ2Ck\devtools15c82d7c2b4c\XML-master" \
--library="R:/home/apps/lsqmapps/apps/r/R-3.1.0/library" --with-keep.source \
--install-tests
* installing *source* package 'XML' ...
Please define LIB_XML (and LIB_ZLIB,LIB_ICONV)
Warning: running command 'sh ./configure.win' had status 1
ERROR: configuration Failed for package 'XML'
* removing 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
* restoring prevIoUs 'R:/home/apps/lsqmapps/apps/r/R-3.1.0/library/XML'
Error: Command Failed (1)
> read_xml()来读取一个XML文件
> xml_children()来获取节点的子节点
> xml_text()来获取标签中的文本
> xml_attrs()来获取节点的属性和值的字符向量,可以使用as.list()将其转换为命名列表
请注意,在完成之后,您仍然需要确保您的XML节点对象为rm(),并强制使用gc()进行垃圾收集,但是内存实际上会被释放到O / S(免责声明:只有在Windows 7上测试,但这似乎是最“内存泄漏”的平台).
希望这有助于某人!